Skip to main content
Log in

Space-Time Tree Ensemble for Action Recognition and Localization

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript


Human actions are, inherently, structured patterns of body movements. We explore ensembles of hierarchical spatio-temporal trees, discovered directly from training data, to model these structures for action recognition and spatial localization. Discovery of frequent and discriminative tree structures is challenging due to the exponential search space, particularly if one allows partial matching. We address this by first building a concise action word vocabulary via discriminative clustering of the hierarchical space-time segments, which is a two-level video representation that captures both static and non-static relevant space-time segments of the video. Using this vocabulary we then utilize tree mining with subsequent tree clustering and ranking to select a compact set of discriminative tree patterns. Our experiments show that these tree patterns, alone, or in combination with shorter patterns (action words and pairwise patterns) achieve promising performance on three challenging datasets: UCF Sports, HighFive and Hollywood3D. Moreover, we perform cross-dataset validation, using trees learned on HighFive to recognize the same actions in Hollywood3D, and using trees learned on UCF-Sports to recognize and localize the similar actions in JHMDB. The results demonstrate the potential for cross-dataset generalization of the trees our approach discovers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others


  1. To avoid notation clutter, we omit the action class label a for \(\mathcal {T}\), \(\mathbf {w}\), \(\Phi \), \(\phi \) and \(\varphi \).

  2. Note that we use notation \(\mathcal {T}\) to denote discovered tree structures of human actions, and notation \(\mathbf {T}\) to denote image segment trees from video frame hierarchical segmentation.

  3. We did not find previous works reporting action classification and localization results for these individual action classes for comparison.


  • Aoun, N. B., Mejdoub, M., & Amar, C. B. (2014). Graph-based approach for human action recognition using spatio-temporal features. Journal of Visual Communication and Image Representation, 25(2), 329–338.

    Article  Google Scholar 

  • Arbelaez, P., Maire, M., Fowlkes, C. C., Malik J. (2009). From contours to regions: An empirical evaluation. In CVPR.

  • Bobick, A. F., & Davis, J. W. (2001). The recognition of human movement using temporal templates. TPAMI, 23(3), 257–267.

    Article  Google Scholar 

  • Brendel, W., Todorovic, S. (2011). Learning spatiotemporal graphs of human activities. In ICCV.

  • Cheáron, G., Laptev, I., Schmid, C. (2015). P-CNN: Pose-based CNN features for action recognition. In ICCV.

  • Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2, 265–292.

    MATH  Google Scholar 

  • Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). Liblinear: A library for large linear classification. JMLR, 9, 1871–1874.

    MATH  Google Scholar 

  • Felzenszwalb, P. F., & Zabih, R. (2011). Dynamic programming and graph algorithms in computer vision. TPAMI, 33(4), 721–740.

    Article  Google Scholar 

  • Frey, B. J., & Dueck, D. (2007). Clustering by passing messages between data points. Science, 315, 972–976.

    Article  MathSciNet  MATH  Google Scholar 

  • Gaidon, A., Harchaoui, Z., & Schmid, C. (2014). Activity representation with motion hierarchies. IJCV, 107(3), 219–238.

    Article  MathSciNet  Google Scholar 

  • Gilbert, A., Bowden, R. (2014). Data mining for action recognition. In ACCV.

  • Gilbert, A., Illingworth, J., & Bowden, R. (2011). Action recognition using mined hierarchical compound features. TPAMI, 33(5), 883–897.

    Article  Google Scholar 

  • Girshick, R., Donahue, J., Darrell, T., Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.

  • Gkioxari, G., Malik, J. (2015). Finding action tubes. In CVPR.

  • Gkioxari, G., Girshick, R., Malik, J. (2015). Contextual action recognition with R*CNN. In ICCV.

  • Gorelick, L., Blank, M., Shechtman, E., Irani, M., & Basri, R. (2007). Actions as space-time shapes. TPAMI, 29(12), 2247–2253.

    Article  Google Scholar 

  • Hadfield, S., Bowden, R. (2013). Hollywood 3D: Recognizing actions in 3D natural scenes. In CVPR.

  • Hadfield, S., Lebeda, K., Bowden, R. (2014). Natural action recognition using invariant 3D motion encoding. In ECCV.

  • Hoai, M., Zisserman, A. (2013). Discriminative sub-categorization. In CVPR.

  • Ikizler, N., & Forsyth, D. A. (2008). Searching for complex human activities with no visual examples. IJCV, 80(3), 337–357.

    Article  Google Scholar 

  • Ikizler-Cinbis, N., Sclaroff, S. (2010). Object, scene and actions: Combining multiple features for human action recognition. In ECCV.

  • Iosifidis, A., Tefas, A., Pitas, I. (2014). Human action recognition based on bag of features and multi-view neural networks. In ICIP.

  • Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M. J. (2013). Towards understanding action recognition. In ICCV.

  • Kantorov, V., Laptev, I. (2014). Efficient feature extraction, encoding, and classification for action recognition. In CVPR.

  • Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR.

  • Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T. (2011). HMDB: A large video database for human motion recognition. In ICCV.

  • Lan, T., Wang, Y., Mori, G. (2011). Discriminative figure-centric models for joint action localization and recognition. In ICCV.

  • Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.

  • Leordeanu, M., Sukthankar, R., Sminchisescu, C. (2012). Efficient closed-form solution to generalized boundary detection. In ECCV.

  • Ma, S., Zhang, J., Ikizler-Cinbis, N., Sclaroff, S. (2013). Action recognition and localization by hierarchical space-time segments. In ICCV.

  • Ma, S., Sigal, L., Sclaroff, S. (2015). Space-time tree ensemble for action recognition. In CVPR.

  • Marszałek, M., Laptev, I., Schmid, C. (2009). Actions in context. In CVPR.

  • Matikainen, P., Hebert, M., Sukthankar, R. (2010). Representing pairwise spatial and temporal relations for action recognition. In ECCV.

  • Mikolajczyk, K., & Uemura, H. (2011). Action recognition with appearance-motion features and fast search trees. CVIU, 115(3), 426–438.

    Google Scholar 

  • Ng, J. Y. H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G. (2015). Beyond short snippets: deep networks for video classification. In CVPR.

  • Nijssen, S., Kok, J. N. (2005). A quickstart in frequent structure mining can make a difference. In ICCS.

  • Oneata, D., Verbeek, J., Schmid, C. (2013). Action and event recognition with fisher vectors on a compact feature set. In ICCV.

  • Patron-Perez, A., Marszalek, M., Zisserman, A., Reid, I. D. (2010). High five: Recognising human interactions in TV shows. In BMVC.

  • Patron-Perez, A., Marszalek, M., Reid, I., & Zisserman, A. (2012). Structured learning of human interactions in TV shows. TPAMI, 34(12), 2441–2453.

    Article  Google Scholar 

  • Perronnin, F., Sánchez, J., Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In ECCV.

  • Ramanan, D., Forsyth, D. A. (2003). Automatic annotation of everyday movements. In NIPS.

  • Raptis, M., Sigal, L. (2013). Poselet key-framing: A model for human activity recognition. In CVPR.

  • Raptis, M., Kokkinos, I., Soatto, S. (2012). Discovering discriminative action parts from mid-level video representations. In CVPR.

  • Rodriguez, M. D., Ahmed, J., Shah, M. (2008). Action mach a spatio-temporal maximum average correlation height filter for action recognition. In CVPR.

  • Sadanand, S., Corso, J. J. (2012). Action bank: A high-level representation of activity in video. In CVPR.

  • Simonyan, K., Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.

  • Tian, Y., Sukthankar, R., Shah, M. (2013). Spatiotemporal deformable part models for action detection. In CVPR.

  • Todorovic, S. (2012). Human activities as stochastic kronecker graphs. In ECCV.

  • Tran, D., Yuan, J. (2011). Optimal spatio-temporal path discovery for video event detection. In CVPR.

  • Tran, D., Yuan, J. (2012). Max-margin structured output regression for spatio-temporal action localization. In NIPS.

  • Wang, H., Schmid, C. (2013). Action recognition with improved trajectories. In ICCV.

  • Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103(1), 60–79.

    Article  MathSciNet  Google Scholar 

  • Wang, H., Oneata, D., Verbeek, J., & Schmid, C. (2016). A robust and efficient video representation for action recognition. IJCV, 119(3), 219–238.

    Article  MathSciNet  Google Scholar 

  • Wang, L., Sahbi, H. (2013). Directed acyclic graph kernels for action recognition. In ICCV.

  • Wang, L., Qiao, Y., Tang, X. (2014). Video action detection with relational dynamic-poselets. In ECCV.

  • Wang, Y., Mori, G. (2008). Learning a discriminative hidden part model for human action recognition. In NIPS.

  • Wang, Y., & Mori, G. (2011). Hidden part models for human action recognition: Probabilistic versus max margin. TPAMI, 33(7), 1310–1323.

    Article  Google Scholar 

  • Wang, Y., Huang, K., Tan, T. (2007). Human activity recognition based on r transform. In CVPR.

  • Wang, Y., Tran, D., Liao, Z., & Forsyth, D. (2012). Discriminative hierarchical part-based models for human parsing and action recognition. JMLR, 13, 30753102.

    MathSciNet  MATH  Google Scholar 

  • Weinland, D., Boyer, E., Ronfard, R. (2007). Action recognition from arbitrary views using 3D exemplars. In ICCV.

  • Weinzaepfel, P., Harchaoui, Z., Schmid, C. (2015). Learning to track for spatio-temporal action localization. In ICCV.

  • Wu, B., Yuan, C., Hu, W. (2014). Human action recognition based on context-dependent graph kernels. In CVPR.

  • Wu, Z., Wang, X., Jiang, Y., Ye, H., Xue, X. (2015). Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In Proceedings of the 23rd ACM international conference on Multimedia.

  • Xie, Y., Chang, H., Li, Z., Liang, L., Chen, X., Zhao, D. (2011). A unified framework for locating and recognizing human actions. In CVPR.

  • Yang, X., Tian, Y. (2014). Action recognition using super sparse coding vector with spatio-temporal awareness. In ECCV.

  • Zhang, H., Zhou, W., Reardon, C. M., Parker, L. E. (2014). Simplex-based 3D spatio-temporal feature description for action recognition. In CVPR.

  • Zitnick, C. L., Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In European Conference on Computer Vision.

Download references


This work was supported in part through a Google Faculty Research Award and by US NSF grants 0855065, 0910908, and 1029430.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Shugao Ma.

Additional information

Communicated by Ivan Laptev and Cordelia Schmid.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, S., Zhang, J., Sclaroff, S. et al. Space-Time Tree Ensemble for Action Recognition and Localization. Int J Comput Vis 126, 314–332 (2018).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: