International Journal of Computer Vision

, Volume 107, Issue 3, pp 219–238 | Cite as

Activity representation with motion hierarchies

  • Adrien GaidonEmail author
  • Zaid Harchaoui
  • Cordelia Schmid


Complex activities, e.g. pole vaulting, are composed of a variable number of sub-events connected by complex spatio-temporal relations, whereas simple actions can be represented as sequences of short temporal parts. In this paper, we learn hierarchical representations of activity videos in an unsupervised manner. These hierarchies of mid-level motion components are data-driven decompositions specific to each video. We introduce a spectral divisive clustering algorithm to efficiently extract a hierarchy over a large number of tracklets (i.e. local trajectories). We use this structure to represent a video as an unordered binary tree. We model this tree using nested histograms of local motion features. We provide an efficient positive definite kernel that computes the structural and visual similarity of two hierarchical decompositions by relying on models of their parent–child relations. We present experimental results on four recent challenging benchmarks: the High Five dataset (Patron-Perez et al., High five: recognising human interactions in TV shows, 2010), the Olympics Sports dataset (Niebles et al., Modeling temporal structure of decomposable motion segments for activity classification, 2010), the Hollywood 2 dataset (Marszalek et al., Actions in context, 2009), and the HMDB dataset (Kuehne et al., HMDB: A large video database for human motion recognition, 2011). We show that per-video hierarchies provide additional information for activity recognition. Our approach improves over unstructured activity models, baselines using other motion decomposition algorithms, and the state of the art.


Action recognition Video analysis Motion decomposition Spectral clustering Kernel methods 



This work was partially funded by the MSR/INRIA joint project, the European integrated project AXES, the PASCAL 2 Network of Excellence, the Gargantua project under program Mastodons of CNRS, the LabEx PERSYVAL-Lab (ANR-11-LABX-0025), and the ERC advanced grant ALLEGRO.


  1. Bilen, H., Namboodiri, V.P., & Van Gool, L.J. (2011). Object and action classification with latent variables. In BMVC, Bristol.Google Scholar
  2. Bradski, G., & Kaehler, A. (2008). Learning OpenCV: Computer vision with the OpenCV library. Sebastopol: O’Reilly Media.Google Scholar
  3. Brendel, W., & Todorovic, S. (2011). Learning spatiotemporal graphs of human activities. In ICCV, Boston.Google Scholar
  4. Brox, T., & Malik, J. (2010). Object segmentation by long term analysis of point trajectories. In ECCV, Carlton.Google Scholar
  5. Dalal, N., Triggs, B., & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. In ECCV, Berlin.Google Scholar
  6. De Castro, E., & Morandi, C. (1987). Registration of translated and rotated images using finite Fourier transforms. In PAMI, Portage la Prairie.Google Scholar
  7. Diestel, R. (2005). Graph theory. Heidelberg: Springer.Google Scholar
  8. Duda, R., Hart, P., & Stork, D. (2001). Pattern classification. New York: WileyGoogle Scholar
  9. Dupé, F., & Brun, L. (2008). Hierarchical bag of paths for kernel based shape classification. In Structural, Syntactic, and Statistical Pattern Recognition, Windsor.Google Scholar
  10. Farnebäck, G. (2003). Two-frame motion estimation based on polynomial expansion. In Image Analysis, Lowa.Google Scholar
  11. Felzenszwalb, P.F., Girshick, R.B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. In PAMI, Portage la Prairie.Google Scholar
  12. Foster, L., Waagen, A., Aijaz, N., Hurley, M., Luis, A., Rinsky, J., Satyavolu, C., Way, M.J., Gazis, P., & Srivastava, A. (2009). Stable and efficient gaussian process calculations. In JMLR, Las Vegas.Google Scholar
  13. Fowlkes, C., Belongie, S., Chung, F., & Malik, J. (2004). Spectral grouping using the Nystrom method. In PAMI, London.Google Scholar
  14. Fradet, M., Robert, P., & Pérez, P. (2009). Clustering point trajectories with various life-spans. In CVMP, London.Google Scholar
  15. Gaidon, A., Harchaoui, Z., & Schmid, C. (2011). Actom sequence models for efficient action detection. In CVPR, Providence.Google Scholar
  16. Gaidon, A., Harchaoui, Z., & Schmid, C. (2012). Recognizing activities with cluster-trees of tracklets. In BMVC, Bristol.Google Scholar
  17. Gilbert, A., Illingworth, J., & Bowden, R. (2010). Action recognition using mined hierarchical compound features. In PAMI, Portage la Prairie.Google Scholar
  18. Grundmann, M., Meier, F., & Essa, I. (2008). 3D shape context and distance transform for action recognition. In ICPR, Delhi.Google Scholar
  19. Hastie, T., Tibshirani, R., & Friedman, J. (2008). The elements of statistical, learning (2nd edition). New York: Springer.Google Scholar
  20. Hongeng, S., & Nevatia, R. (2003). Large-scale event detection using semi-hidden markov models. In ICCV, Boston.Google Scholar
  21. Ikizler-Cinbis, N., & Sclaroff, S. (2010). Object, scene and actions: Combining multiple features for human action recognition. In ECCV, Carlton.Google Scholar
  22. Jiang, Y., Dai, Q., Xue, X., Liu, W., & Ngo, C. (2012a), Trajectory-based modeling of human actions with motion reference points. In ECCV, Carlton.Google Scholar
  23. Jiang, Z., Lin, Z., & Davis, L. (2012b). Recognizing human actions by learning and matching shape-motion prototype trees. In PAMI, Portage la Prairie.Google Scholar
  24. Kliper-Gross, O., Gurovich, Y., Hassner, T., & Wolf, L. (2012). Motion interchange patterns for action recognition in unconstrained videos. In ECCV, Carlton.Google Scholar
  25. Kovashka, A., & Grauman, K. (2010). Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In CVPR, Providence.Google Scholar
  26. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: a large video database for human motion recognition. In ICCV, New York.Google Scholar
  27. Laptev, I. (2005). On space-time interest points. In IJCV, Rosario.Google Scholar
  28. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR, Providence.Google Scholar
  29. Laxton, B., Lim, J., & Kriegman, D. (2007). Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video. In CVPR, Providence.Google Scholar
  30. Lezama, J., Alahari, K., Sivic, J., & Laptev, I. (2011). Track to the future: Spatio-temporal video segmentation with long-range motion cues. In CVPR, Providence.Google Scholar
  31. Liu, J., Kuipers, B., & Savarese, S. (2011). Recognizing human actions by attributes. In CVPR, Providence.Google Scholar
  32. Lowe, D.G. (2004). Distinctive image features from scale-invariant keypoints. In IJCV, Ho Chi Minh.Google Scholar
  33. Maji, S., Berg, A., & Malik, J. (2008). Classification using intersection kernel support vector machines is efficient. In CVPR, Providence.Google Scholar
  34. Marszalek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In CVPR, Providence.Google Scholar
  35. Matikainen, P., Hebert, M., & Sukthankar, R. (2010). Representing pairwise spatial and temporal relations for action recognition. In ECCV, Carlton.Google Scholar
  36. Mikolajczyk, K., & Uemura, H. (2008). Action recognition with motion-appearance vocabulary forest. In CVPR, Providence.Google Scholar
  37. Niebles, J.C., & Fei-Fei, L. (2007). Hierarchical model of shape and appearance for human action classification. In CVPR, Providence.Google Scholar
  38. Niebles, J.C., Chen, C., & Fei-Fei, L. (2010) Modeling temporal structure of decomposable motion segments for activity classification. In ECCV, Carlton.Google Scholar
  39. Oliver, N.M., Rosario, B., & Pentland, A.P. (2000). A Bayesian computer vision system for modeling human interactions. In PAMI, Portage la Prairie.Google Scholar
  40. Pablo, A., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation. In PAMI, Portage la Prairie.Google Scholar
  41. Patron-Perez, A., Marszalek, M., Zisserman, A., & Reid, I.D. (2010). High five: Recognising human interactions in TV shows. In BMVC, Bristol.Google Scholar
  42. Prest, A., Ferrari, V., & Schmid, C. (2012). Explicit modeling of human-object interactions in realistic videos. In PAMI, Portage la Prairie.Google Scholar
  43. Raptis, M., Kokkinos, I., & Soatto, S. (2012). Discovering discriminative action parts from mid-level video representations. In CVPR, Providence.Google Scholar
  44. Reddy, K. K., Liu, J., & Shah, M. (2009). Incremental action recognition using feature-tree. In CVPR, Providence.Google Scholar
  45. Sadanand, S., Corso, J. J. (2012). Action bank: A high-level representation of activity in video. In CVPR, New York.Google Scholar
  46. Sapienza, M., Cuzzolin, F., & Torr, P. (2012). Learning discriminative space-time actions from weakly labelled videos. In BMVC, Bristol.Google Scholar
  47. Schölkopf, B., & Smola, A. J. (2002). Learning with Kernels. Mexico: MIT Press.Google Scholar
  48. Schölkopf, B., Smola, A., & Müller, K. R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319.CrossRefGoogle Scholar
  49. Sculley, D. (2010). Web-scale k-means clustering. In WWW, New York.Google Scholar
  50. Shawe-Taylor, J. (2004). Cristianini. Cambridge: Cambridge Univ Press.Google Scholar
  51. Shi, J., & Malik, J. (1998). Motion segmentation and tracking using normalized cuts. In ICCV, IEEE, Beijing.Google Scholar
  52. Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. In PAMI, London.Google Scholar
  53. Shi, J., & Tomasi, C. (1994). Good features to track. In CVPR, Providence.Google Scholar
  54. Suard, F., Rakotomamonjy, A., & Bensrhair, A. (2007). Kernel on bag of paths for measuring similarity of shapes. In European Symposium on Artificial Neural Networks, pp 1–6.Google Scholar
  55. Szeliski, R. (2010). Computer vision: Algorithms and applications. New York: Springer.Google Scholar
  56. Tang, K., Fei-Fei, L., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In CVPR, Providence.Google Scholar
  57. Todorovic, S. (2012). Human activities as stochastic kronecker graphs. In ECCV, Carlton.Google Scholar
  58. Vig, E., Dorr, M., & Cox, D. (2012). Space-variant descriptor sampling for action recognition based on saliency and eye movements. In ECCV, Carlton. Google Scholar
  59. Wang, H., Kläser, A., Schmid, C., & Cheng-Lin, L. (2013). Dense trajectories and motion boundary descriptors for action recognition. In IJCV, Dublin.Google Scholar
  60. Wang, Y. (2011). & Mori, G. Probabilistic versus max-margin. PAMI: Hidden part models for human action recognition.Google Scholar
  61. Williams, C., & Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In NIPS, Allahabad.Google Scholar
  62. Yu, G., Yuan, J., & Liu, Z. (2012). Propagative hough voting for human activity recognition. In ECCV, New York.Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Adrien Gaidon
    • 1
    Email author
  • Zaid Harchaoui
    • 2
  • Cordelia Schmid
    • 2
  1. 1.Xerox Research Center EuropeMeylanFrance
  2. 2.LEAR Team, INRIA Grenoble Rhône-AlpesMontbonnot France

Personalised recommendations