International Journal of Computer Vision

, Volume 126, Issue 2–4, pp 314–332 | Cite as

Space-Time Tree Ensemble for Action Recognition and Localization

  • Shugao Ma
  • Jianming Zhang
  • Stan Sclaroff
  • Nazli Ikizler-Cinbis
  • Leonid Sigal


Human actions are, inherently, structured patterns of body movements. We explore ensembles of hierarchical spatio-temporal trees, discovered directly from training data, to model these structures for action recognition and spatial localization. Discovery of frequent and discriminative tree structures is challenging due to the exponential search space, particularly if one allows partial matching. We address this by first building a concise action word vocabulary via discriminative clustering of the hierarchical space-time segments, which is a two-level video representation that captures both static and non-static relevant space-time segments of the video. Using this vocabulary we then utilize tree mining with subsequent tree clustering and ranking to select a compact set of discriminative tree patterns. Our experiments show that these tree patterns, alone, or in combination with shorter patterns (action words and pairwise patterns) achieve promising performance on three challenging datasets: UCF Sports, HighFive and Hollywood3D. Moreover, we perform cross-dataset validation, using trees learned on HighFive to recognize the same actions in Hollywood3D, and using trees learned on UCF-Sports to recognize and localize the similar actions in JHMDB. The results demonstrate the potential for cross-dataset generalization of the trees our approach discovers.


Action recognition Action localization Space-time tree structure 



This work was supported in part through a Google Faculty Research Award and by US NSF grants 0855065, 0910908, and 1029430.


  1. Aoun, N. B., Mejdoub, M., & Amar, C. B. (2014). Graph-based approach for human action recognition using spatio-temporal features. Journal of Visual Communication and Image Representation, 25(2), 329–338.CrossRefGoogle Scholar
  2. Arbelaez, P., Maire, M., Fowlkes, C. C., Malik J. (2009). From contours to regions: An empirical evaluation. In CVPR.Google Scholar
  3. Bobick, A. F., & Davis, J. W. (2001). The recognition of human movement using temporal templates. TPAMI, 23(3), 257–267.CrossRefGoogle Scholar
  4. Brendel, W., Todorovic, S. (2011). Learning spatiotemporal graphs of human activities. In ICCV.Google Scholar
  5. Cheáron, G., Laptev, I., Schmid, C. (2015). P-CNN: Pose-based CNN features for action recognition. In ICCV.Google Scholar
  6. Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2, 265–292.zbMATHGoogle Scholar
  7. Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). Liblinear: A library for large linear classification. JMLR, 9, 1871–1874.zbMATHGoogle Scholar
  8. Felzenszwalb, P. F., & Zabih, R. (2011). Dynamic programming and graph algorithms in computer vision. TPAMI, 33(4), 721–740.CrossRefGoogle Scholar
  9. Frey, B. J., & Dueck, D. (2007). Clustering by passing messages between data points. Science, 315, 972–976.MathSciNetCrossRefzbMATHGoogle Scholar
  10. Gaidon, A., Harchaoui, Z., & Schmid, C. (2014). Activity representation with motion hierarchies. IJCV, 107(3), 219–238.MathSciNetCrossRefGoogle Scholar
  11. Gilbert, A., Bowden, R. (2014). Data mining for action recognition. In ACCV.Google Scholar
  12. Gilbert, A., Illingworth, J., & Bowden, R. (2011). Action recognition using mined hierarchical compound features. TPAMI, 33(5), 883–897.CrossRefGoogle Scholar
  13. Girshick, R., Donahue, J., Darrell, T., Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.Google Scholar
  14. Gkioxari, G., Malik, J. (2015). Finding action tubes. In CVPR.Google Scholar
  15. Gkioxari, G., Girshick, R., Malik, J. (2015). Contextual action recognition with R*CNN. In ICCV.Google Scholar
  16. Gorelick, L., Blank, M., Shechtman, E., Irani, M., & Basri, R. (2007). Actions as space-time shapes. TPAMI, 29(12), 2247–2253.CrossRefGoogle Scholar
  17. Hadfield, S., Bowden, R. (2013). Hollywood 3D: Recognizing actions in 3D natural scenes. In CVPR.Google Scholar
  18. Hadfield, S., Lebeda, K., Bowden, R. (2014). Natural action recognition using invariant 3D motion encoding. In ECCV.Google Scholar
  19. Hoai, M., Zisserman, A. (2013). Discriminative sub-categorization. In CVPR.Google Scholar
  20. Ikizler, N., & Forsyth, D. A. (2008). Searching for complex human activities with no visual examples. IJCV, 80(3), 337–357.CrossRefGoogle Scholar
  21. Ikizler-Cinbis, N., Sclaroff, S. (2010). Object, scene and actions: Combining multiple features for human action recognition. In ECCV.Google Scholar
  22. Iosifidis, A., Tefas, A., Pitas, I. (2014). Human action recognition based on bag of features and multi-view neural networks. In ICIP.Google Scholar
  23. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M. J. (2013). Towards understanding action recognition. In ICCV.Google Scholar
  24. Kantorov, V., Laptev, I. (2014). Efficient feature extraction, encoding, and classification for action recognition. In CVPR.Google Scholar
  25. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR.Google Scholar
  26. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T. (2011). HMDB: A large video database for human motion recognition. In ICCV.Google Scholar
  27. Lan, T., Wang, Y., Mori, G. (2011). Discriminative figure-centric models for joint action localization and recognition. In ICCV.Google Scholar
  28. Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.Google Scholar
  29. Leordeanu, M., Sukthankar, R., Sminchisescu, C. (2012). Efficient closed-form solution to generalized boundary detection. In ECCV.Google Scholar
  30. Ma, S., Zhang, J., Ikizler-Cinbis, N., Sclaroff, S. (2013). Action recognition and localization by hierarchical space-time segments. In ICCV.Google Scholar
  31. Ma, S., Sigal, L., Sclaroff, S. (2015). Space-time tree ensemble for action recognition. In CVPR.Google Scholar
  32. Marszałek, M., Laptev, I., Schmid, C. (2009). Actions in context. In CVPR.Google Scholar
  33. Matikainen, P., Hebert, M., Sukthankar, R. (2010). Representing pairwise spatial and temporal relations for action recognition. In ECCV.Google Scholar
  34. Mikolajczyk, K., & Uemura, H. (2011). Action recognition with appearance-motion features and fast search trees. CVIU, 115(3), 426–438.Google Scholar
  35. Ng, J. Y. H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G. (2015). Beyond short snippets: deep networks for video classification. In CVPR.Google Scholar
  36. Nijssen, S., Kok, J. N. (2005). A quickstart in frequent structure mining can make a difference. In ICCS.Google Scholar
  37. Oneata, D., Verbeek, J., Schmid, C. (2013). Action and event recognition with fisher vectors on a compact feature set. In ICCV.Google Scholar
  38. Patron-Perez, A., Marszalek, M., Zisserman, A., Reid, I. D. (2010). High five: Recognising human interactions in TV shows. In BMVC.Google Scholar
  39. Patron-Perez, A., Marszalek, M., Reid, I., & Zisserman, A. (2012). Structured learning of human interactions in TV shows. TPAMI, 34(12), 2441–2453.CrossRefGoogle Scholar
  40. Perronnin, F., Sánchez, J., Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In ECCV.Google Scholar
  41. Ramanan, D., Forsyth, D. A. (2003). Automatic annotation of everyday movements. In NIPS.Google Scholar
  42. Raptis, M., Sigal, L. (2013). Poselet key-framing: A model for human activity recognition. In CVPR.Google Scholar
  43. Raptis, M., Kokkinos, I., Soatto, S. (2012). Discovering discriminative action parts from mid-level video representations. In CVPR.Google Scholar
  44. Rodriguez, M. D., Ahmed, J., Shah, M. (2008). Action mach a spatio-temporal maximum average correlation height filter for action recognition. In CVPR.Google Scholar
  45. Sadanand, S., Corso, J. J. (2012). Action bank: A high-level representation of activity in video. In CVPR.Google Scholar
  46. Simonyan, K., Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.Google Scholar
  47. Tian, Y., Sukthankar, R., Shah, M. (2013). Spatiotemporal deformable part models for action detection. In CVPR.Google Scholar
  48. Todorovic, S. (2012). Human activities as stochastic kronecker graphs. In ECCV.Google Scholar
  49. Tran, D., Yuan, J. (2011). Optimal spatio-temporal path discovery for video event detection. In CVPR.Google Scholar
  50. Tran, D., Yuan, J. (2012). Max-margin structured output regression for spatio-temporal action localization. In NIPS.Google Scholar
  51. Wang, H., Schmid, C. (2013). Action recognition with improved trajectories. In ICCV.Google Scholar
  52. Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103(1), 60–79.MathSciNetCrossRefGoogle Scholar
  53. Wang, H., Oneata, D., Verbeek, J., & Schmid, C. (2016). A robust and efficient video representation for action recognition. IJCV, 119(3), 219–238.MathSciNetCrossRefGoogle Scholar
  54. Wang, L., Sahbi, H. (2013). Directed acyclic graph kernels for action recognition. In ICCV.Google Scholar
  55. Wang, L., Qiao, Y., Tang, X. (2014). Video action detection with relational dynamic-poselets. In ECCV.Google Scholar
  56. Wang, Y., Mori, G. (2008). Learning a discriminative hidden part model for human action recognition. In NIPS.Google Scholar
  57. Wang, Y., & Mori, G. (2011). Hidden part models for human action recognition: Probabilistic versus max margin. TPAMI, 33(7), 1310–1323.CrossRefGoogle Scholar
  58. Wang, Y., Huang, K., Tan, T. (2007). Human activity recognition based on r transform. In CVPR.Google Scholar
  59. Wang, Y., Tran, D., Liao, Z., & Forsyth, D. (2012). Discriminative hierarchical part-based models for human parsing and action recognition. JMLR, 13, 30753102.MathSciNetzbMATHGoogle Scholar
  60. Weinland, D., Boyer, E., Ronfard, R. (2007). Action recognition from arbitrary views using 3D exemplars. In ICCV.Google Scholar
  61. Weinzaepfel, P., Harchaoui, Z., Schmid, C. (2015). Learning to track for spatio-temporal action localization. In ICCV.Google Scholar
  62. Wu, B., Yuan, C., Hu, W. (2014). Human action recognition based on context-dependent graph kernels. In CVPR.Google Scholar
  63. Wu, Z., Wang, X., Jiang, Y., Ye, H., Xue, X. (2015). Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In Proceedings of the 23rd ACM international conference on Multimedia.Google Scholar
  64. Xie, Y., Chang, H., Li, Z., Liang, L., Chen, X., Zhao, D. (2011). A unified framework for locating and recognizing human actions. In CVPR.Google Scholar
  65. Yang, X., Tian, Y. (2014). Action recognition using super sparse coding vector with spatio-temporal awareness. In ECCV.Google Scholar
  66. Zhang, H., Zhou, W., Reardon, C. M., Parker, L. E. (2014). Simplex-based 3D spatio-temporal feature description for action recognition. In CVPR.Google Scholar
  67. Zitnick, C. L., Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In European Conference on Computer Vision.Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.Computer ScienceBoston UniversityBostonUSA
  2. 2.Adobe ResearchSan JoseUSA
  3. 3.Computer EngineeringHacettepe UniversityAnkaraTurkey
  4. 4.Disney ResearchPittsburghUSA

Personalised recommendations