Detecting Violent Content in Hollywood Movies and User-Generated Videos

  • Esra Acar
  • Melanie Irrgang
  • Dominique Maniry
  • Frank Hopfgartner
Part of the Advances in Computer Vision and Pattern Recognition book series (ACVPR)


Detecting violent scenes in videos is an important content understanding functionality, e.g., for providing automated youth protection services. The key issues in designing violence detection algorithms are the choice of discriminative features and learning effective models. We employ low and mid-level audio-visual features and evaluate their discriminative power within the context of the MediaEval Violent Scenes Detection (VSD) task. The audio-visual cues are fused at the decision level. As audio features, Mel-Frequency Cepstral Coefficients (MFCC), and as visual features dense histogram of oriented gradient (HoG), histogram of oriented optical flow (HoF), Violent Flows (ViF), and affect-related color descriptors are used. We perform feature space partitioning of the violence training samples through k-means clustering and train a different model for each cluster. These models are then used to predict the violence level of videos by employing two-class support vector machines (SVMs). The experimental results in Hollywood movies and short web videos show that mid-level audio features are more discriminative than the visual features, and that the performance is further enhanced by fusing the audio-visual cues at the decision level.


Sparse Code Video Segment Video Shot Audio Feature MFCC Feature 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The research leading to these results has received funding from the European Community FP7 under grant agreement number 261743 (NoE VideoSense). We would like to thank Technicolor ( for providing the ground truth, video shot boundaries, and the corresponding keyframes which have been used in this work. Our thanks also go to Fudan University and Vietnam University of Science for providing the ground truth of the Web video dataset.


  1. 1.
    E. Acar, F. Hopfgartner, S. Albayrak, Detecting violent content in Hollywood Movies by mid-level audio representations, in CBMI 2013 (IEEE 2013)Google Scholar
  2. 2.
    R. Akbani, S. Kwek, N. Japkowicz, Applying support vector machines to imbalanced datasets. Mach. Learn.: ECML 2004, 39–50 (2004)Google Scholar
  3. 3.
    S. Andrews, I. Tsochantaridis, T. Hofmann, Support vector machines for multiple-instance learning. Adv. Neural Inf. Process. Syst. 15, 561–568 (2002)Google Scholar
  4. 4.
    B.J. Bushman, L.R. Huesmann, Short-term and long-term effects of violent media on aggression in children and adults. Arch. Pediatr. Adolesc. Med. 160(4), 348 (2006)CrossRefGoogle Scholar
  5. 5.
    L.-H. Chen, H.-W. Hsu, L.-Y. Wang, C.-W. Su, Horror video scene recognition via multiple-instance learning, in ICASSP (2011)Google Scholar
  6. 6.
    L.-H. Chen, H.-W. Hsu, L.-Y. Wang, C.-W. Su, Violence detection in movies, in 2011 Eighth International Conference on Computer Graphics, Imaging and Visualization (CGIV) (IEEE, 2011), pp. 119–124Google Scholar
  7. 7.
    F.D.M. de Souza, G.C. Chávez, E.A. do Valle, A. de A. Araujo. Violence detection in video using spatio-temporal features, in 2010 23rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI) (IEEE, 2010), pp. 224–230Google Scholar
  8. 8.
    C.-H. Demarty, B. Ionescu, Y.-G. Jiang, V.L. Quang, M. Schedl, C. Penet, Benchmarking violent scenes detection in movies, in 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI) (IEEE, 2014), pp. 1–6Google Scholar
  9. 9.
    C.-H. Demarty, C. Penet, M. Schedl, B. Ionescu, Vu L. Quang, Yu-G. Jiang, The MediaEval 2013 affect task: violent scenes detection, in Working Notes Proceedings of the MediaEval 2013 Workshop, Barcelona, Spain, 18–19 October 2013Google Scholar
  10. 10.
    N. Derbas, G. Quénot, Joint audio-visual words for violent scenes detection in movies, in Proceedings of International Conference on Multimedia Retrieval (ACM, 2014), p. 483Google Scholar
  11. 11.
    X. Ding, B. Li, W. Hu, W. Xiong, Z. Wang, Horror video scene recognition based on multi-view multi-instance learning, in Computer Vision-ACCV 2012 (Springer, 2013), pp. 599–610Google Scholar
  12. 12.
    B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression. Ann. Stat. 32(2), 407–499 (2004)CrossRefMathSciNetMATHGoogle Scholar
  13. 13.
    T. Giannakopoulos, D. Kosmopoulos, A. Aristidou, S. Theodoridis, Violence content classification using audio features. Adv. Artif. Intell. 3955, 502–507 (2006)CrossRefGoogle Scholar
  14. 14.
    T. Giannakopoulos, A. Makris, D. Kosmopoulos, S. Perantonis, S. Theodoridis, Audio-visual fusion for detecting violent scenes in videos. Artif. Intell.: Theor. Model. Appl. 6040, 91–100 (2010)Google Scholar
  15. 15.
    Y. Gong, W. Wang, S. Jiang, Q. Huang, W. Gao, Detecting violent scenes in movies by auditory and visual cues. Adv. Multimed. Inf. Process.-PCM 2008, 317–326 (2008)Google Scholar
  16. 16.
    S. Goto, T. Aoki, Violent scenes detection using mid-level violence clustering. Comput. Sci. (2014)Google Scholar
  17. 17.
    D. Hasler, S.E. Suesstrunk, Measuring colorfulness in natural images, in Electronic Imaging 2003. International Society for Optics and Photonics, pp. 87–95 (2003)Google Scholar
  18. 18.
    T. Hassner, Y. Itcher, O. Kliper-Gross, Violent flows: real-time detection of violent crowd behavior, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (IEEE, 2012), pp. 1–6Google Scholar
  19. 19.
    H. He, E.A. Garcia, Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRefGoogle Scholar
  20. 20.
    B.K. Horn, B.G. Schunck, Determining optical flow, in 1981 Technical Symposium East. International Society for Optics and Photonics, pp. 319–331 (1981)Google Scholar
  21. 21.
    B. Ionescu, J. Schlüter, I. Mironica, M. Schedl, A naive mid-level concept-based fusion approach to violence detection in Hollywood Movies, in Proceedings of the 3rd ACM Conference on International Conference on Multimedia Retrieval (ACM, 2013), pp. 215–222Google Scholar
  22. 22.
    O. Kliper-Gross, T. Hassner, L. Wolf, The action similarity labeling challenge. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 615–621 (2012)CrossRefGoogle Scholar
  23. 23.
    I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning realistic human actions from movies, in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008 (IEEE, 2008), pages 1–8Google Scholar
  24. 24.
    J. Lin, W. Wang, Weakly-supervised violence detection in movies with audio and video based co-training. Adv. Multimed. Inf. Process.-PCM 2009, 930–935 (2009)Google Scholar
  25. 25.
    J. Machajdik, A. Hanbury, Affective image classification using features inspired by psychology and art theory, in Proceedings of the International Conference on Multimedia (ACM, 2010), pp. 83–92Google Scholar
  26. 26.
    J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res. 11, 19–60 (2010)MathSciNetMATHGoogle Scholar
  27. 27.
    D. Maniry, E. Acar, F. Hopfgartner, S. Albayrak, A visualization tool for violent scenes detection, in Proceedings of ACM Conference on Multimedia Retrieval, ICMR’14 (ACM, 2014), pp. 522–523Google Scholar
  28. 28.
    E.B. Nievas, O.D. Suarez, G.B. García, R. Sukthankar, Violence detection in video using computer vision techniques, in Computer Analysis of Images and Patterns (Springer, 2011), pp. 332–339Google Scholar
  29. 29.
    C. Penet, C.-H. Demarty, G. Gravier, P. Gros, Multimodal information fusion and temporal integration for violence detection in movies, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2012), pp. 2393–2396Google Scholar
  30. 30.
    C. Penet, C.-H. Demarty, G. Gravier, P. Gros et al., Technicolor/inria team at the mediaeval 2013 violent scenes detection task. MediaEval 2013 Working Notes (2013)Google Scholar
  31. 31.
    M. Sjöberg, B. Ionescu, Y.-G. Jiang, V.L. Quang, M. Schedl, C.-H. Demarty. The MediaEval 2014 affect task: violent scenes detection, in Working Notes Proceedings of the MediaEval 2014 Workshop, Barcelona, Spain, 16–17 October 2014Google Scholar
  32. 32.
    J.R.R. Uijlings, I. Duta, N. Rostamzadeh, N. Sebe, Realtime video classification using dense HOF/HOG, in Proceedings of International Conference on Multimedia Retrieval (ACM, 2014), p. 145Google Scholar
  33. 33.
    H.L. Wang, L.F. Cheong, Affective understanding in film. IEEE Trans. Circuits Syst. Video Technol. 16(6), 689–704 (2006)CrossRefGoogle Scholar
  34. 34.
    W. Ting-Fan, C.-J. Lin, R.C. Weng, Probability estimates for multi-class classification by pairwise coupling. J. Mach. Learn. Res. 5, 975–1005 (2004)MATHGoogle Scholar
  35. 35.
    L. Xu, C. Gong, J. Yang, Q. Wu, L. Yao, Violent video detection based on MoSIFT feature and sparse coding, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2014), pp. 3538–3542Google Scholar
  36. 36.
    M. Xu, N.C. Maddage, C. Xu, M. Kankanhalli, Q. Tian, Creating audio keywords for event detection in soccer video, in ICME’03 (IEEE 2003)Google Scholar
  37. 37.
    R. Yan, M. Naphade, Semi-supervised cross feature learning for semantic concept detection in videos, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1 (IEEE, 2005), pp. 657–663Google Scholar
  38. 38.
    L. Yeffet, L. Wolf, Local trinary patterns for human action recognition, in 2009 IEEE 12th International Conference on Computer Vision (IEEE, 2009), pp. 492–497Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Esra Acar
    • 1
  • Melanie Irrgang
    • 1
  • Dominique Maniry
    • 1
  • Frank Hopfgartner
    • 1
  1. 1.Technische Universität BerlinBerlinGermany

Personalised recommendations