Spatio-temporal elastic cuboid trajectories for efficient fight recognition using Hough forests


While action recognition has become an important line of research in computer vision, the recognition of particular events such as aggressive behaviors, or fights, has been relatively less studied. These tasks may be exceedingly useful in some video surveillance scenarios such as psychiatric centers, prisons or even in personal camera smartphones. Their potential usability has caused a surge of interest in developing fight or violence detectors. The key aspect in this case is efficiency, that is, these methods should be computationally very fast. In this paper, spatio-temporal elastic cuboid trajectories are proposed for fight recognition. This method is based on the use of blob movements to create trajectories that capture and model the different motions that are specific to a fight. The proposed method is robust to the specific shapes and positions of the individuals. Additionally, the standard Hough forests classifier is adapted in order to use it with this descriptor. This method is compared to other nine related methods on four datasets. The results show that the proposed method obtains the best accuracy for each dataset and is also computationally efficient.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11


  1. 1.

    Poppe, R.: A survey on vision-based human action recognition. Image Vis. Comput. 28(6), 976–990 (2010)

    Article  Google Scholar 

  2. 2.

    Turaga, P., Chellappa, R., Subrahmanian, V., Udrea, O.: Machine recognition of human activities: a survey. Circuits Syst. Video Technol. IEEE Trans. 18(11), 1473–1488 (2008)

    Article  Google Scholar 

  3. 3.

    Shian-Ru, K., Hoang Le Uyen, T., Yong-Jin, L., Jenq-Neng, H., Jang-Hee, Y., et al.: A review on video-based human activity recognition. Computers 2(2), 88–131 (2013)

    Google Scholar 

  4. 4.

    Laptev, I., Lindeberg, T.: Space-time interest points. In: Proceedings of International Conference on Computer Vision, pp. 432–439. (2003)

  5. 5.

    Bermejo, E., Deniz, O., Bueno, G., Sukthankar, R.: Violence detection in video using computer vision techniques. In: 14th International Congress on Computer Analysis of Images and Patterns, pp. 332–339. (2011)

  6. 6.

    Hu, K., Yin, L.: Multi-scale topological features for hand posture representation and analysis. In: IEEE International Conference on Computer Vision (ICCV), 2013, pp. 1928–1935. (2013)

  7. 7.

    Waltisberg, D., Yao, A., Gall, J., Van Gool, L.: Variations of a Hough-voting action recognition system. In: Ünay, D., Çataltepe, Z., Aksoy, S. (eds.) Recognizing Patterns in Signals, Speech, Images and Videos, pp. 306–312. Springer, Berlin (2010)

  8. 8.

    Gall, J., Yao, A., Razavi, N., Van Gool, L., Lempitsky, V.: Hough forests for object detection, tracking, and action recognition. Pattern Anal. Mach. Intell. IEEE Trans. 33(11), 2188–2202 (2011)

    Article  Google Scholar 

  9. 9.

    Yao, A., Gall, J., Van Gool, L.: A Hough transform-based voting framework for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 2061–2068. (2010)

  10. 10.

    Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79(3), 299–318 (2008)

    Article  Google Scholar 

  11. 11.

    Nam, J., Alghoniemy, M., Tewfik, A.: Audio-visual content-based violent scene characterization. In: Proceedings of ICIP, pp. 353–357. (1998)

  12. 12.

    Cheng, W., Chu, W., Wu, J.L.: Semantic context detection based on hierarchical audio models. In: Proceedings of the ACM SIGMM Workshop on Multimedia Information Retrieval, New York, pp. 109–115. (2003)

  13. 13.

    Clarin, C., Dionisio, J., Echavez, M., Naval, P.: Dove: detection of movie violence using motion intensity analysis on skin and blood. PCSC 6, 150–156 (2005)

    Google Scholar 

  14. 14.

    Giannakopoulos, T., Kosmopoulos, D., Aristidou, A., Theodoridis, S.: Violence content classification using audio features. In: Advances in Artificial Intelligence, Lecture Notes in Computer Science, vol. 3955, pp. 502–507. (2006)

  15. 15.

    Zajdel, W., Krijnders, J., Andringa, T., Gavrila, D.: CASSANDRA: audio-video sensor fusion for aggression detection. In: IEEE Conference on Advanced Video and Signal Based Surveillance, 2007, pp. 200–205. (2007)

  16. 16.

    Gong, Y., Wang, W., Jiang, S., Huang, Q., Gao, W.: Detecting violent scenes in movies by auditory and visual cues. In: Proceedings of the 9th Pacific Rim Conference on Multimedia, pp. 317–326. Springer, Berlin (2008)

  17. 17.

    Chen, D., Wactlar, H., Chen, M., Gao, C., Bharucha, A., Hauptmann, A.: Recognition of aggressive human behavior using binary local motion descriptors. In: Engineering in Medicine and Biology Society, 2008. (20–25 2008) pp. 5238–5241 (2008)

  18. 18.

    Lin, J., Wang, W.: Weakly-supervised violence detection in movies with audio and video based co-training. In: Proceedings of the 10th Pacific Rim Conference on Multimedia, pp. 930–935. Springer, Berlin (2009)

  19. 19.

    Giannakopoulos, T., Makris, A., Kosmopoulos, D., Perantonis, S., Theodoridis, S.: Audio-visual fusion for detecting violent scenes in videos. In: 6th Hellenic Conference on AI, SETN 2010, Athens, Greece, May 4–7, 2010. Proceedings, pp. 91–100. Springer, London (2010)

  20. 20.

    Chen, L., Su, C., Hsu, H.: Violent scene detection in movies. IJPRAI 25(8), 1161–1172 (2011)

    Google Scholar 

  21. 21.

    Hassner, T., Itcher, Y., Kliper-Gross, O.: Violent flows: real-time detection of violent crowd behavior. In: 3rd IEEE International Workshop on Socially Intelligent Surveillance and Monitoring (SISM) at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2012)

  22. 22.

    Demarty, C., Penet, C., Gravier, G., Soleymani, M.: MediaEval 2012 affect task: violent scenes detection in Hollywood movies. In: MediaEval 2012 Workshop, Pisa (2012)

  23. 23.

    Ward, R.K., Guha, T.: Learning sparse representations for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 34(8), 1576–1588 (2012)

    Article  Google Scholar 

  24. 24.

    Mohammadi, S., Kiani, H., Perina, A., Murino, V.: Violence detection in crowded scenes using substantial derivative. In: International Conference on Advanced Video and Signal-based Surveillance, AVSS, (2015)

  25. 25.

    Deniz, O., Serrano, I., Bueno, G., Kim, T.K.: Fast violence detection in video. In: The 9th International Conference on Computer Vision Theory and Applications (VISAPP), (2014)

  26. 26.

    Serrano, I., Déniz, O., Bueno, G.: Visilab at MediaEval 2013: fight detection. In: MediaEval 2013, vol. 1043. MediaEval Bechmark (2013)

  27. 27.

    Chen, M., Mummert, L., Pillai, P., Hauptmann, A., Sukthankar, R.: Exploiting multi-level parallelism for low-latency activity recognition in streaming video. In: MMSys ’10: Proceedings of the First Annual ACM SIGMM Conference on Multimedia Systems, New York, pp. 1–12. (2010)

  28. 28.

    Serrano, I., Deniz, O., Bueno, G., Kim, T.K.: Fast fight detection. PLoS ONE 10(4), e0120448 (2015)

    Article  Google Scholar 

  29. 29.

    Tobias, S., Volker, E., Thomas, S.: A local feature based on Lagrangian measures for violent video classification. In: 6th International Conference on Imaging for Crime Prevention and Detection, IET (2015)

  30. 30.

    Gao, Y., Liu, H., Sun, X., Wang, C., Liu, Y.: Violence detection using oriented violent flows. Image Vis. Comput. 48, 37–41 (2016)

    Article  Google Scholar 

  31. 31.

    Matikainen, P., Hebert, M., Sukthankar, R.: Trajectons: Action recognition through the motion analysis of tracked features. In: IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops), 2009, pp. 514–521. (2009)

  32. 32.

    Messing, R., Pal, C., Kautz, H.: Activity recognition using the velocity histories of tracked keypoints. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 104–111. (2009)

  33. 33.

    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3169–3176. (2011)

  34. 34.

    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 3551–3558. (2013)

  35. 35.

    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., Weinberger, K. Q. (eds.) Advances in Neural Information Processing Systems 27, pp. 568–576. Curran Associates, Inc. (2014)

  36. 36.

    Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4305–4314. (2015)

  37. 37.

    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497. (2015)

  38. 38.

    Dollár, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005, pp. 65–72. (2005)

  39. 39.

    Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3361–3368. (2011)

  40. 40.

    Taylor, G.W., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learning of spatio-temporal features. In: Computer Vision–ECCV 2010, pp. 140–153. Springer, Berlin (2010)

  41. 41.

    Willems, G., Tuytelaars, T., Van Gool, L.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: Computer Vision–ECCV 2008, pp. 650–663. Springer, Berlin (2008)

  42. 42.

    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  43. 43.

    Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In: IEEE International Conference on Computer Vision (ICCV), (2009)

  44. 44.

    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. (2008)

Download references


This work has been partially supported by Project TIN2011-24367 from Spain’s Ministry of Economy and Competitiveness.

Author information



Corresponding author

Correspondence to Ismael Serrano.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Serrano, I., Deniz, O., Bueno, G. et al. Spatio-temporal elastic cuboid trajectories for efficient fight recognition using Hough forests. Machine Vision and Applications 29, 207–217 (2018).

Download citation


  • Violence recognition
  • Fight recognition
  • Descriptor
  • Blobs
  • Video sequences
  • Hough forests