Multimedia Tools and Applications

, Volume 76, Issue 5, pp 7041–7065 | Cite as

Evaluation of multiple features for violent scenes detection

  • Vu Lam
  • Sang Phan
  • Duy-Dinh Le
  • Duc Anh Duong
  • Shin’ichi Satoh


Violent scenes detection (VSD) is a challenging problem because of the heterogeneous content, large variations in video quality, and complex semantic meanings of the concepts involved. In the last few years, combining multiple features from multi-modalities has proven to be an effective strategy for general multimedia event detection (MED), but the specific event detection like VSD has been comparatively less studied. Here, we evaluated the use of multiple features and their combination in a violent scenes detection system. We rigorously analyzed a set of low-level features and a deep learning feature that captures the appearance, color, texture, motion and audio in video. We also evaluated the utility of mid-level visual information obtained from detecting related violent concepts. Experiments were performed on the publicly available MediaEval VSD 2014 dataset. The results showed that visual and motion features are better than audio features. Moreover, the performance of the mid-level features was nearly as good as that of the low-level visual features. Experiments with a number of fusion methods showed that all single features are complementary and help to improve overall performance. This study also provides an empirical foundation for selecting feature sets that are capable of dealing with heterogeneous content comprising violent scenes in movies.


Violent scenes detection Video retrieval Multi-modal fusion Multiple features 



This research is funded by Vietnam National University Ho Chi Minh City (VNU-HCM) under grant number B2013-26-01.


  1. 1.
    Acar E, Albayrak S (2014) Tub-irml at mediaeval 2014 violent scenes detection task: Violence modeling through feature space partitioningGoogle Scholar
  2. 2.
    Aly R, Arandjelovic R, Chatfield K, Douze M, Fernando B, Harchaoui Z, McGuinness K, O’Connor NE, Oneata D, Parkhi OM (2013) The axes submissions at trecvidGoogle Scholar
  3. 3.
    Avila S, Moreira D, Perez M, Moraes D, Cota I, Testoni V, Valle E, Goldenstein S, Rocha A (2014) Recod at mediaeval 2014: Violent scenes detection taskGoogle Scholar
  4. 4.
    Bogdan I, Schluter J, Mironica I, Schedl M (2013) A naive mid-level concept-based fusion approach to violence detection in hollywood movies. In: ACM Conference on International Conference on Multimedia Retrieval, pp 215–222Google Scholar
  5. 5.
    Bosch A, Zisserman A, Muñoz X (2006) Scene classification via plsa Computer vision–ECCV 2006. Springer, pp 517–530Google Scholar
  6. 6.
    Bosch A, Zisserman A, Muoz X (2007) Image classification using random forests and ferns. In: IEEE 11th international conference on Computer vision. ICCV 2007, pp 1–8. IEEEGoogle Scholar
  7. 7.
    Burghouts GJ, Geusebroek JM (2009) Performance evaluation of local colour invariants. Comput Vis Image Underst 113(1):48–62CrossRefGoogle Scholar
  8. 8.
    Clarin C, Dionisio J, Echavez M, Naval PC (2005) Dove: Detection of movie violence using motion intensity analysis on skin and blood. Workshops and Demonstrations - ECCV:150–156Google Scholar
  9. 9.
    Castán D, Rodríguez M, Ortega A, Orrite C, Lleida E (2014) Vivolab and cvlab-mediaeval 2014: Violent scenes detection affect taskGoogle Scholar
  10. 10.
    Cdric P, Demarty CH, Gravier G, Gros P (2011) Technicolor and inria/irisa at mediaeval 2011: Learning temporal modality integration with bayesian networks. In: MediaEval Multimedia Benchmark WorkshopGoogle Scholar
  11. 11.
    Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27. Software available at CrossRefGoogle Scholar
  12. 12.
    Chapelle O (2007) Training a support vector machine in the primal. Neural Comput 19(5):1155–1178MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, vol 1, pp 1–2Google Scholar
  14. 14.
    Dai Q, Tu J, Shi Z, Jiang YG, Xue X (2013) Fudan at mediaeval 2013: Violent scenes detection using motion features and part-level attributes. In: MediaevalGoogle Scholar
  15. 15.
    Dai Q, Wu Z, Jiang YG, Xue X, Tang J (2014) Fudan-njust at mediaeval 2014: Violent scenes detection using deep neural networksGoogle Scholar
  16. 16.
    Demarty CH, Ionescu B, Jiang YG, Quang VL, Schedl M, Penet C (2014) Benchmarking violent scenes detection in movies. In: Content-based multimedia indexing (CBMI), 2014 12th international workshop on. IEEE, pp 1–6Google Scholar
  17. 17.
    Demarty CH, Penet C, Soleymani M, Gravier G (2014) Vsd, a public dataset for the detection of violent scenes in movies: design, annotation, analysis and evaluation. Multimedia Tools and Applications :1–26Google Scholar
  18. 18.
    Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on Computer vision and pattern recognition, 2009. CVPR 2009, pp 248–255. IEEEGoogle Scholar
  19. 19.
    Derbas N, Safadi B, Quénot G, et al. (2013) Lig at mediaeval 2013 affect task: Use of a generic method and joint audio-visual words. In: Mediaeval. CiteseerGoogle Scholar
  20. 20.
    Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2013). Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531
  21. 21.
    Harris ZS (1954) Distributional structure. WordGoogle Scholar
  22. 22.
    Hauptmann A, Yan R, Lin WH, Christel M, Wactlar H (2007) Can high-level concepts fill the semantic gap in video retrieval? a case study with broadcast news. IEEE Trans Multimedia 9(5):958–966. doi: 10.1109/TMM.2007.900150 CrossRefGoogle Scholar
  23. 23.
    Hung MH, Pan JS (2015) A real-time action detection system for surveillance videos using template matching. Journal of Information Hiding and Multimedia Signal Processing 6(6):1088–1099Google Scholar
  24. 24.
    Jaakkola T, Haussler D et al. (1999) Exploiting generative models in discriminative classifiers. Advances in neural information processing systems:487–493Google Scholar
  25. 25.
    Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the ACM international conference on multimedia. ACM, pp 675–678Google Scholar
  26. 26.
    Jian L, Wang W (2009) Weakly-supervised violence detection in movies with audio and video based co-training. Advances in Multimedia Information Processing-PCM, pp 930–935Google Scholar
  27. 27.
    Jiang YG, Ngo CW, Yang J (2007) Towards optimal bag-of-features for object categorization and semantic video retrieval. In: Proceedings of the 6th ACM international conference on image and video retrieval. ACM, pp 494–501Google Scholar
  28. 28.
    Jiang YG, Yang J, Ngo CW, Hauptmann AG (2010) Representations of keypoint-based semantic concept detection: a comprehensive study. IEEE Trans Multimedia 12(1):42–53CrossRefGoogle Scholar
  29. 29.
    Jiang YG, Zeng X, Ye G, Ellis D, Chang SF, Bhattacharya S, Shah M (2010) Columbia-ucf trecvid2010 multimedia event detection: Combining multiple modalities, contextual concepts, and temporal matching. In: TRECVIDGoogle Scholar
  30. 30.
    Jingen L, Kuipers B, Savarese S (2011) Recognizing human actions by attributes. IEEE Conf Comput Vis Pattern Recognit (CVPR):3337–3344Google Scholar
  31. 31.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105Google Scholar
  32. 32.
    Lai PS, Cheng SS, Sun SY, Huang T, Su J, Xu YY, Chen Y, Chuang SC, Tseng C, Hsieh C (2005) Automated information mining on multimedia tv news archives. In: Knowledge-based intelligent information and engineering systems. Springer, pp 1238–1244Google Scholar
  33. 33.
    Lam V, Le DD, Le SP, Satoh S, Duong DA (2012) Nii, Japan at mediaeval 2012 violent scenes detection affect task. In: Mediaeval CiteseerGoogle Scholar
  34. 34.
    Lam V, Le D, Phan S, Satoh S, Duong DA (2014) NII-UIT at mediaeval 2014 violent scenes detection affect taskGoogle Scholar
  35. 35.
    Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE computer society conference on Computer vision and pattern recognition, 2006, vol 2, pp 2169–2178. IEEEGoogle Scholar
  36. 36.
    Li-Jia L, Su H, Fei-Fei L, Xing EP (2010) Object bank: a high-level image representation for scene classification & semantic feature sparsification. Advances in Neural Information Processing Systems:1378–1386Google Scholar
  37. 37.
    Liang-Hua C, Hsu HW, Wang LY, Su CW (2011) Violence detection in movies. Computer Graphics Imaging and Visualization (CGIV):119–124Google Scholar
  38. 38.
    Liu H, Singh P (2004) Conceptneta practical commonsense reasoning tool-kit. BT technology journal 22(4):211–226CrossRefGoogle Scholar
  39. 39.
    Liu XF, Zhu XX (2015) Parallel feature extraction through preserving global and discriminative property for kernel-based image classification. Journal of Information Hiding and Multimedia Signal Processing 6(5):977–986Google Scholar
  40. 40.
    Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110CrossRefGoogle Scholar
  41. 41.
    Ma Z, Yang Y, Cai Y, Sebe N, Hauptmann AG (2012) Knowledge adaptation for ad hoc multimedia event detection with few exemplars. In: Proceedings of the 20th ACM International Conference on Multimedia, MM ’12 , pp 469–478. doi: 10.1145/2393347.2393414
  42. 42.
    Merler M, Huang B, Xie L, Hua G, Natsev A (2012) Semantic model vectors for complex video event recognition. IEEE Trans Multimedia 14(1):88–101CrossRefGoogle Scholar
  43. 43.
    Mikolajczyk K, Schmid C (2002) An affine invariant interest point detector. In: Computer VisionECCV 2002. Springer, pp 128–142Google Scholar
  44. 44.
    Myers GK, Nallapati R, van Hout J, Pancoast S, Nevatia R, Sun C, Habibian A, Koelma DC, van de Sande KE, Smeulders AW (2014) Evaluating multimedia features and fusion for example-based event detection. Mach Vis Appl 25 (1):17–32CrossRefGoogle Scholar
  45. 45.
    Nam J, Alghoniemy M, Tewfik AH (1998) Audio-visual content-based violent scene characterization. In: Image processing, 1998. ICIP 98. Proceedings. 1998 international conference on. IEEE, vol 1, pp 353–357Google Scholar
  46. 46.
    Nascimento do, Teixeira B (2014) Mtm at mediaeval 2014 violence detection taskGoogle Scholar
  47. 47.
    Oh S, McCloskey S, Kim I, Vahdat A, Cannons KJ, Hajimirsadeghi H, Mori G, Perera AA, Pandey M, Corso JJ (2014) Multimedia event detection with multimodal feature fusion and temporal concept localization. Mach Vis Appl 25 (1):49–69Google Scholar
  48. 48.
    Oneata D, Verbeek J, Schmid C (2014) The lear submission at thumosGoogle Scholar
  49. 49.
    Penet C, Demarty CH, Gravier G, Gros P, et al. (2013) Technicolor/inria team at the mediaeval 2013 violent scenes detection task. In: MediaEval 2013 Working NotesGoogle Scholar
  50. 50.
    Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: Computer vision–ECCV 2010. Springer, pp 143–156Google Scholar
  51. 51.
    Rabiner LR, Schafer RW (2007) Introduction to digital speech processing. Foundations and trends in signal processing 1(1):1–194CrossRefzbMATHGoogle Scholar
  52. 52.
    Sadanand S, Corso JJ (2012) Action bank: a high-level representation of activity in video. In: Computer vision and pattern recognition (CVPR), 2012 IEEE conference on. IEEE, pp 1234–1241Google Scholar
  53. 53.
    Sánchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: Theory and practice. Int J Comput Vis 105(3):222–245MathSciNetCrossRefzbMATHGoogle Scholar
  54. 54.
    Sivic J, Zisserman A (2009) Efficient visual search of videos cast as text retrieval. IEEE Trans Pattern Anal Mach Intell 31(4):591–606CrossRefGoogle Scholar
  55. 55.
    Sjöberg M, Schlüter J, Ionescu B, Schedl M (2013) Far at mediaeval 2013 violent scenes detection: Concept-based violent scenes detection in movies. In: MediaevalGoogle Scholar
  56. 56.
    Sjöberg M, Mironica I, Schedl M, Ionescu B (2014) Far at mediaeval 2014 violent scenes detection: A concept-based fusion approachGoogle Scholar
  57. 57.
    Snoek CG, Worring M, Smeulders AW (2005) Early versus late fusion in semantic video analysis. In: Proceedings of the 13th annual ACM international conference on multimedia. ACM, pp 399–402Google Scholar
  58. 58.
    Sun C, Nevatia R (2013) Large-scale web video event classification by use of fisher vectors. In: Applications of computer vision (WACV), 2013 IEEE workshop on. IEEE, pp 15–22Google Scholar
  59. 59.
    Tan CC, Ngo CW (2013) The vireo team at mediaeval 2013: Violent scenes detection by mid-level concepts learnt from youtube. In: MediaevalGoogle Scholar
  60. 60.
    Tv and movie violence (2010) Why watching it is harmful to children. Accessed 10 Jan 2015
  61. 61.
    Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Computer vision (ICCV), 2013 IEEE international conference on. IEEE, pp 3551–3558Google Scholar
  62. 62.
    Yu G, Wang W, Jiang S, Huang Q, Gao W (2008) Detecting violent scenes in movies by auditory and visual cues. Advances in Multimedia Information Processing-PCM:317–326Google Scholar
  63. 63.
    Zhang B, Yi Y, Wang H, Yu J (2014) Mic-tju at mediaeval violent scenes detection (vsd) 2014Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Vu Lam
    • 1
  • Sang Phan
    • 2
  • Duy-Dinh Le
    • 2
  • Duc Anh Duong
    • 3
  • Shin’ichi Satoh
    • 2
  1. 1.University of Science, VNU-HCMCHo Chi MinhVietnam
  2. 2.National Institute of InformaticsTokyoJapan
  3. 3.University of Information Technology, VNU-HCMCHo Chi MinhVietnam

Personalised recommendations