Machine Vision and Applications

, Volume 29, Issue 2, pp 207–217 | Cite as

Spatio-temporal elastic cuboid trajectories for efficient fight recognition using Hough forests

  • Ismael SerranoEmail author
  • Oscar Deniz
  • Gloria Bueno
  • Guillermo Garcia-Hernando
  • Tae-Kyun Kim
Original Paper


While action recognition has become an important line of research in computer vision, the recognition of particular events such as aggressive behaviors, or fights, has been relatively less studied. These tasks may be exceedingly useful in some video surveillance scenarios such as psychiatric centers, prisons or even in personal camera smartphones. Their potential usability has caused a surge of interest in developing fight or violence detectors. The key aspect in this case is efficiency, that is, these methods should be computationally very fast. In this paper, spatio-temporal elastic cuboid trajectories are proposed for fight recognition. This method is based on the use of blob movements to create trajectories that capture and model the different motions that are specific to a fight. The proposed method is robust to the specific shapes and positions of the individuals. Additionally, the standard Hough forests classifier is adapted in order to use it with this descriptor. This method is compared to other nine related methods on four datasets. The results show that the proposed method obtains the best accuracy for each dataset and is also computationally efficient.


Violence recognition Fight recognition Descriptor Blobs Video sequences Hough forests 



This work has been partially supported by Project TIN2011-24367 from Spain’s Ministry of Economy and Competitiveness.


  1. 1.
    Poppe, R.: A survey on vision-based human action recognition. Image Vis. Comput. 28(6), 976–990 (2010)CrossRefGoogle Scholar
  2. 2.
    Turaga, P., Chellappa, R., Subrahmanian, V., Udrea, O.: Machine recognition of human activities: a survey. Circuits Syst. Video Technol. IEEE Trans. 18(11), 1473–1488 (2008)CrossRefGoogle Scholar
  3. 3.
    Shian-Ru, K., Hoang Le Uyen, T., Yong-Jin, L., Jenq-Neng, H., Jang-Hee, Y., et al.: A review on video-based human activity recognition. Computers 2(2), 88–131 (2013)Google Scholar
  4. 4.
    Laptev, I., Lindeberg, T.: Space-time interest points. In: Proceedings of International Conference on Computer Vision, pp. 432–439. (2003)Google Scholar
  5. 5.
    Bermejo, E., Deniz, O., Bueno, G., Sukthankar, R.: Violence detection in video using computer vision techniques. In: 14th International Congress on Computer Analysis of Images and Patterns, pp. 332–339. (2011)Google Scholar
  6. 6.
    Hu, K., Yin, L.: Multi-scale topological features for hand posture representation and analysis. In: IEEE International Conference on Computer Vision (ICCV), 2013, pp. 1928–1935. (2013)Google Scholar
  7. 7.
    Waltisberg, D., Yao, A., Gall, J., Van Gool, L.: Variations of a Hough-voting action recognition system. In: Ünay, D., Çataltepe, Z., Aksoy, S. (eds.) Recognizing Patterns in Signals, Speech, Images and Videos, pp. 306–312. Springer, Berlin (2010)Google Scholar
  8. 8.
    Gall, J., Yao, A., Razavi, N., Van Gool, L., Lempitsky, V.: Hough forests for object detection, tracking, and action recognition. Pattern Anal. Mach. Intell. IEEE Trans. 33(11), 2188–2202 (2011)CrossRefGoogle Scholar
  9. 9.
    Yao, A., Gall, J., Van Gool, L.: A Hough transform-based voting framework for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 2061–2068. (2010)Google Scholar
  10. 10.
    Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79(3), 299–318 (2008)CrossRefGoogle Scholar
  11. 11.
    Nam, J., Alghoniemy, M., Tewfik, A.: Audio-visual content-based violent scene characterization. In: Proceedings of ICIP, pp. 353–357. (1998)Google Scholar
  12. 12.
    Cheng, W., Chu, W., Wu, J.L.: Semantic context detection based on hierarchical audio models. In: Proceedings of the ACM SIGMM Workshop on Multimedia Information Retrieval, New York, pp. 109–115. (2003)Google Scholar
  13. 13.
    Clarin, C., Dionisio, J., Echavez, M., Naval, P.: Dove: detection of movie violence using motion intensity analysis on skin and blood. PCSC 6, 150–156 (2005)Google Scholar
  14. 14.
    Giannakopoulos, T., Kosmopoulos, D., Aristidou, A., Theodoridis, S.: Violence content classification using audio features. In: Advances in Artificial Intelligence, Lecture Notes in Computer Science, vol. 3955, pp. 502–507. (2006)Google Scholar
  15. 15.
    Zajdel, W., Krijnders, J., Andringa, T., Gavrila, D.: CASSANDRA: audio-video sensor fusion for aggression detection. In: IEEE Conference on Advanced Video and Signal Based Surveillance, 2007, pp. 200–205. (2007)Google Scholar
  16. 16.
    Gong, Y., Wang, W., Jiang, S., Huang, Q., Gao, W.: Detecting violent scenes in movies by auditory and visual cues. In: Proceedings of the 9th Pacific Rim Conference on Multimedia, pp. 317–326. Springer, Berlin (2008)Google Scholar
  17. 17.
    Chen, D., Wactlar, H., Chen, M., Gao, C., Bharucha, A., Hauptmann, A.: Recognition of aggressive human behavior using binary local motion descriptors. In: Engineering in Medicine and Biology Society, 2008. (20–25 2008) pp. 5238–5241 (2008)Google Scholar
  18. 18.
    Lin, J., Wang, W.: Weakly-supervised violence detection in movies with audio and video based co-training. In: Proceedings of the 10th Pacific Rim Conference on Multimedia, pp. 930–935. Springer, Berlin (2009)Google Scholar
  19. 19.
    Giannakopoulos, T., Makris, A., Kosmopoulos, D., Perantonis, S., Theodoridis, S.: Audio-visual fusion for detecting violent scenes in videos. In: 6th Hellenic Conference on AI, SETN 2010, Athens, Greece, May 4–7, 2010. Proceedings, pp. 91–100. Springer, London (2010)Google Scholar
  20. 20.
    Chen, L., Su, C., Hsu, H.: Violent scene detection in movies. IJPRAI 25(8), 1161–1172 (2011)Google Scholar
  21. 21.
    Hassner, T., Itcher, Y., Kliper-Gross, O.: Violent flows: real-time detection of violent crowd behavior. In: 3rd IEEE International Workshop on Socially Intelligent Surveillance and Monitoring (SISM) at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2012)Google Scholar
  22. 22.
    Demarty, C., Penet, C., Gravier, G., Soleymani, M.: MediaEval 2012 affect task: violent scenes detection in Hollywood movies. In: MediaEval 2012 Workshop, Pisa (2012)Google Scholar
  23. 23.
    Ward, R.K., Guha, T.: Learning sparse representations for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 34(8), 1576–1588 (2012)CrossRefGoogle Scholar
  24. 24.
    Mohammadi, S., Kiani, H., Perina, A., Murino, V.: Violence detection in crowded scenes using substantial derivative. In: International Conference on Advanced Video and Signal-based Surveillance, AVSS, (2015)Google Scholar
  25. 25.
    Deniz, O., Serrano, I., Bueno, G., Kim, T.K.: Fast violence detection in video. In: The 9th International Conference on Computer Vision Theory and Applications (VISAPP), (2014)Google Scholar
  26. 26.
    Serrano, I., Déniz, O., Bueno, G.: Visilab at MediaEval 2013: fight detection. In: MediaEval 2013, vol. 1043. MediaEval Bechmark (2013)Google Scholar
  27. 27.
    Chen, M., Mummert, L., Pillai, P., Hauptmann, A., Sukthankar, R.: Exploiting multi-level parallelism for low-latency activity recognition in streaming video. In: MMSys ’10: Proceedings of the First Annual ACM SIGMM Conference on Multimedia Systems, New York, pp. 1–12. (2010)Google Scholar
  28. 28.
    Serrano, I., Deniz, O., Bueno, G., Kim, T.K.: Fast fight detection. PLoS ONE 10(4), e0120448 (2015)CrossRefGoogle Scholar
  29. 29.
    Tobias, S., Volker, E., Thomas, S.: A local feature based on Lagrangian measures for violent video classification. In: 6th International Conference on Imaging for Crime Prevention and Detection, IET (2015)Google Scholar
  30. 30.
    Gao, Y., Liu, H., Sun, X., Wang, C., Liu, Y.: Violence detection using oriented violent flows. Image Vis. Comput. 48, 37–41 (2016)CrossRefGoogle Scholar
  31. 31.
    Matikainen, P., Hebert, M., Sukthankar, R.: Trajectons: Action recognition through the motion analysis of tracked features. In: IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops), 2009, pp. 514–521. (2009)Google Scholar
  32. 32.
    Messing, R., Pal, C., Kautz, H.: Activity recognition using the velocity histories of tracked keypoints. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 104–111. (2009)Google Scholar
  33. 33.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3169–3176. (2011)Google Scholar
  34. 34.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 3551–3558. (2013)Google Scholar
  35. 35.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., Weinberger, K. Q. (eds.) Advances in Neural Information Processing Systems 27, pp. 568–576. Curran Associates, Inc. (2014)Google Scholar
  36. 36.
    Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4305–4314. (2015)Google Scholar
  37. 37.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497. (2015)Google Scholar
  38. 38.
    Dollár, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005, pp. 65–72. (2005)Google Scholar
  39. 39.
    Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3361–3368. (2011)Google Scholar
  40. 40.
    Taylor, G.W., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learning of spatio-temporal features. In: Computer Vision–ECCV 2010, pp. 140–153. Springer, Berlin (2010)Google Scholar
  41. 41.
    Willems, G., Tuytelaars, T., Van Gool, L.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: Computer Vision–ECCV 2008, pp. 650–663. Springer, Berlin (2008)Google Scholar
  42. 42.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefzbMATHGoogle Scholar
  43. 43.
    Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In: IEEE International Conference on Computer Vision (ICCV), (2009)Google Scholar
  44. 44.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. (2008)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2017

Authors and Affiliations

  1. 1.VISILAB groupUniversity of Castilla-La ManchaCiudad RealSpain
  2. 2.Department of Electrical and Electronic EngineeringUniversity of Imperial CollegeLondonUK

Personalised recommendations