Skip to main content

Advertisement

Log in

Violence recognition on videos using two-stream 3D CNN with custom spatiotemporal crop

  • 1229: Multimedia Data Analysis for Smart City Environment Safety
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Violence may happen anywhere. One of the ways to know and oversee the violence in some places is by installing Closed-circuit Television (CCTV). The recorded video captured by CCTV can be used as proof in a law court. Violence video classification is also one of the topics being discussed in deep learning. The latest violence video dataset is RWF-2000. That dataset contains violent and non-violent videos, 5 seconds duration, 30 frames per second, with the amount of 2000 videos. That publication also has the best accuracy of 87.25% by their proposed method. In this study, we will use a Residual Network known to have the advantage of solving the vanishing gradient problem. Beside that, we also implement transfer learning from Kinetics and Kinetics + Moments in Time pre-trained data. We also test the number of frames and the location range of the sampled frames. RGB and optical flow inputs are separately trained with different configurations. The RGB input best accuracy is 89.25% with pre-trained Kinetics + Moments in Time, using frame location of 49-149. The optical flow input best accuracy is 88.5% with pre-trained Kinetics, using 74 frames. We also try to sum the output of both inputs making accuracy of 90.5%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Availability of data and materials

The datasets generated during and/or analysed during the current study are available in the RWF-2000 repository, https://github.com/mchengny/RWF2000-Video-Database-for-Violence-Detection

References

  1. Bakkouri I, Afdel K (2020) Computer-aided diagnosis (CAD) system based on multi-layer feature fusion network for skin lesion recognition in dermoscopy images. Multimedia Tools and Applications. 79(29–30):20483–20518. https://doi.org/10.1007/s11042-019-07988-1

    Article  Google Scholar 

  2. Bakkouri I, Afdel K (2022) MLCA2F: Multi-level context attentional feature fusion for COVID-19 lesion segmentation from CT scans. Signal, Image and Video Processing 1–8. https://doi.org/10.1007/s11760-022-02325-w

  3. Kompas.com (2020) Terekam CCTV Kasari dua anak majikannya, ART ini dilaporkan ke polisi. Available via Kompas.com: https://regional.kompas.com/read/2020/03/07/06470011/terekam-cctv-kasari-dua-anak-majikannya-art-ini-dilaporkan-ke-polisi. Cited 20 Sept 2021

  4. detikNews, 2020. Aksi kelompok pemuda di sukabumi lakukan pengeroyokan terekam CCTV. Available via detikNews: https://news.detik.com/berita-jawa-barat/d-5163568/aksi-kelompok-pemuda-di-sukabumi-lakukan-pengeroyokan-terekam-cctv. 20 Sept 2021

  5. Hassner, Tal, Yossi Itcher, Orit K (2012) Violent flows: Real-time detection of violent crowd behavior. 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. https://doi.org/10.1109/CVPRW.2012.6239348

  6. Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. IEEE/CVF Conference on Computer Vision and Pattern Recognition 6479–6488. DOI

  7. Cheng, Ming, Kunjing Cai, Ming Li (2021) RWF-2000: an open large scale video database for violence detection. 2020 25th International Conference on Pattern Recognition (ICPR). https://doi.org/10.1109/ICPR48806.2021.9412502

  8. Yudistira N, Kurita T (2017) Gated spatio and temporal convolutional neural network for activity recognition: towards gated multimodal deep learning. EURASIP Journal on Image and Video Processing 85. https://doi.org/10.1186/s13640-017-0235-9

  9. Wang L, Li W, Li W, Van Gool L (2018) Appearance-and-Relation networks for video classification. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, UT: IEEE 1430–1439. https://doi.org/10.1109/CVPR.2018.00155

  10. Hara K, Kataoka H, Satoh Y (2018) Towards good practice for action recognition with spatiotemporal 3D convolutions. 2018 24th International Conference on Pattern Recognition (ICPR) 2516–2521. https://doi.org/10.1109/ICPR.2018.8546325

  11. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset, in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 6299–6308 https://doi.org/10.1109/CVPR.2017.502

  12. Liang Q, Li Y, Yang K, Wang X, Li Z (2021) Long-term recurrent convolutional network violent Behaviour recognition with attention mechanism. In MATEC Web of Conferences.EDP Sciences, 336:05013

  13. Islam Z, Rukonuzzaman M, Ahmed R, Kabir MH, Farazi M (2021) Efficient two-stream network for violence detection using separable convolutional lstm. In 2021 International Joint Conference on Neural Networks (IJCNN) 1–8. https://doi.org/10.48550/arXiv.2102.10590

  14. Kataoka H, Wakamiya T, Hara K, Satoh Y (2020) Would mega-scale datasets further enhance spatiotemporal 3D CNNs?. https://doi.org/10.48550/arXiv.2004.04968

  15. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. https://doi.org/10.48550/arXiv.1711.11248

  16. Carreira J, Noland E, Hillier C, Zisserman A (2019) A short note on the kinetics-700 human action dataset. https://doi.org/10.48550/arXiv.1907.06987

  17. Monfort M et al (2019) Moments in time dataset: one million videos for event understanding. IEEE transactions on pattern analysis and machine intelligence 42(2):502–508. https://doi.org/10.1109/TPAMI.2019.2901464

    Article  Google Scholar 

  18. Mumtaz N, Ejaz N, Aladhadh S, Habib S, Lee MY (2022) Deep multi-scale features fusion for effective violence detection and control charts visualization. Sensors. 22(23):9383

    Article  Google Scholar 

  19. Rendón-Segador FJ, Álvarez-García JA, Enríquez F, Deniz O (2021) Violencenet: Dense multi-head self-attention with bidirectional convolutional lstm for detecting violence. Electronics 10(13):1601

    Article  Google Scholar 

  20. Sudhakaran S, Lanz O (2017) Learning to detect violent videos using convolutional long short-term memory. In: Advanced Video and Signal Based Surveillance (AVSS), 2017 14th IEEE International Conference on. IEEE, 1–6. https://doi.org/10.1109/AVSS.2017.8078468

  21. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, 2015 4489–4497. https://doi.org/10.1109/ICCV.2015.510

  22. Bakkouri I, & Afdel K (2022) MLCA2F: Multi-level context attentional feature fusion for COVID-19 lesion segmentation from CT scans. Signal, Image and Video Processing, 1–8. https://doi.org/10.1007/s11760-022-02325-w

  23. Bakkouri I, & Afdel K (2020) Computer-aided diagnosis (CAD) system based on multi-layer feature fusion network for skin lesion recognition in dermoscopy images. Multimedia Tools and Applications. 79(29-30):20483–20518. https://doi.org/10.1007/s11042-019-07988-1

  24. Sudhakaran S, & Lanz O (2017) Learning to detect violent videos using convolutional long short-term memory. In: 2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE, pp 1–6. August

  25. Zhao Y, Man KL, Smith J, & Guan SU (2022) A novel two-stream structure for video anomaly detection in smart city management. The Journal of Supercomputing. 78(3):3940–3954

  26. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE international conference on computer vision 2015:4489–4497. https://doi.org/10.1109/ICCV.2015.510

    Article  Google Scholar 

  27. Rendón-Segador FJ, Álvarez-García, JA, Enríquez F, & Deniz O (2021) Violencenet: Dense multi-head self-attention with bidirectional convolutional lstm for detecting violence. Electronics, 10(13):1601

  28. Haque M, Afsha S, & Nyeem H An efficient deep learning model for violence detection. Available at SSRN 4327716

  29. Ullah A, Muhammad K, Haydarov K, Haq IU, Lee M, & Baik SW (2020) One-shot learning for surveillance anomaly recognition using siamese 3d cnn. In: 2020 International Joint Conference on Neural Networks (IJCNN).IEEE, pp. 1-8

  30. Xia X, Wu H, Yang C (2021) Violence detection with two-stream neural network based on C3D. International Journal of Cognitive Informatics and Natural Intelligence (IJCINI). 15(4):1–17

    Google Scholar 

  31. Yudistira N, Kurita T (2017) Gated spatio and temporal convolutional neural network for activity recognition: towards gated multimodal deep learning. EURASIP Journal on Image and Video Processing 85. https://doi.org/10.1186/s13640-017-0235-9

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Novanto Yudistira.

Ethics declarations

Conflicts of interests/Competing interests

The authors have no conflicts of interest to declare that are relevant to the content of this article

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

All authors are contributed equally to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pratama, R.A., Yudistira, N. & Bachtiar, F.A. Violence recognition on videos using two-stream 3D CNN with custom spatiotemporal crop. Multimed Tools Appl (2023). https://doi.org/10.1007/s11042-023-15599-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-023-15599-0

Keywords

Navigation