Spatio-Temporal Phrases for Activity Recognition

  • Yimeng Zhang
  • Xiaoming Liu
  • Ming-Ching Chang
  • Weina Ge
  • Tsuhan Chen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7574)


The local feature based approaches have become popular for activity recognition. A local feature captures the local movement and appearance of a local region in a video, and thus can be ambiguous; e.g., it cannot tell whether a movement is from a person’s hand or foot, when the camera is far away from the person. To better distinguish different types of activities, people have proposed using the combination of local features to encode the relationships of local movements. Due to the computation limit, previous work only creates a combination from neighboring features in space and/or time. In this paper, we propose an approach that efficiently identifies both local and long-range motion interactions; taking the “push” activity as an example, our approach can capture the combination of the hand movement of one person and the foot response of another person, the local features of which are both spatially and temporally far away from each other. Our computational complexity is in linear time to the number of local features in a video. The extensive experiments show that our approach is generically effective for recognizing a wide variety of activities and activities spanning a long term, compared to a number of state-of-the-art methods.


Activity Recognition Spatio-Temporal Phrases 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.J.: Behavior recognition via sparse spatio-temporal features. In: PETS Workshop (2005)Google Scholar
  2. 2.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)Google Scholar
  3. 3.
    Willems, G., Tuytelaars, T., Van Gool, L.: An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 650–663. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  4. 4.
    Wang, X.G., Ma, X.X., Grimson, W.E.L.: Unsupervised activity perception in crowded and complicated scenes using hierarchical Bayesian models. PAMI 31, 539–555 (2009)CrossRefGoogle Scholar
  5. 5.
    Liu, J.G., Luo, J.B., Shah, M.: Recognizing realistic actions from videos “in the wild”. In: CVPR (2009)Google Scholar
  6. 6.
    Liu, J.G., Shah, M.: Learning human actions via information maximization. In: CVPR (2008)Google Scholar
  7. 7.
    Wong, S.F., Kim, T.K., Cipolla, R.: Learning motion categories using both semantic and structural information. In: CVPR (2007)Google Scholar
  8. 8.
    Gaur, U., Zhu, Y., Song, B., Roy-Chowdhury, A.: A “string of feature graphs” model for recognition of complex activities in natural videos. In: ICCV (2011)Google Scholar
  9. 9.
    Wang, P., Abowd, G.D., Rehg, J.M.: Quasi-periodic event analysis for social game retrieval. In: ICCV (2009)Google Scholar
  10. 10.
    Duan, L., Xu, D., Tsang, I.W.H., Luo, J.: Visual event recognition in videos by learning from web data. In: CVPR (2010)Google Scholar
  11. 11.
    Nowozin, S., Bakir, G., Tsuda, K.: Discriminative subsequence mining for action classification. In: ICCV (2007)Google Scholar
  12. 12.
    Sun, J., Wu, X., Yan, S.C., Cheong, L.F., Chua, T.S., Li, J.T.: Hierarchical spatio-temporal context modeling for action recognition. In: CVPR (2009)Google Scholar
  13. 13.
    Savarese, S., Pozo, A.D., Niebles, J.C., Li, F.F.: Spatial-temporal correlatons for unsupervised action classification. In: WMVC (2008)Google Scholar
  14. 14.
    Gilbert, A., Illingworth, J., Bowden, R.: Scale Invariant Action Recognition Using Compound Features Mined from Dense Spatio-temporal Corners. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 222–233. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  15. 15.
    Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: CVPR (2010)Google Scholar
  16. 16.
    Zhang, Y., Jia, Z., Chen, T.: Image retrieval with geometry-preserving visual phrases. In: CVPR (2011)Google Scholar
  17. 17.
    Wu, S., Moore, B.E., Shah, M.: Chaotic invariants of lagrangian particle trajectories for anomaly detection in crowded scenes. In: CVPR (2010)Google Scholar
  18. 18.
    Messing, R., Pal, C., Kautz, H.A.: Activity recognition using the velocity histories of tracked keypoints. In: ICCV (2009)Google Scholar
  19. 19.
    Yao, A., Gall, J., Van Gool, L.: A Hough transform-based voting framework for action recognition. In: CVPR (2010)Google Scholar
  20. 20.
    Zhang, Y., Chen, T.: Efficient kernels for identifying unbounded-order spatial features. In: CVPR (2009)Google Scholar
  21. 21.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: ICPR (2004)Google Scholar
  22. 22.
    Chang, M.C., Krahnstoever, N., Ge, W.: Probabilistic group-level motion analysis and scenario recognition. In: ICCV (2011)Google Scholar
  23. 23.
    Brendel, W., Todorovic, S.: Learning spatiotemporal graphs of human activities. In: ICCV (2011)Google Scholar
  24. 24.
    Ryoo, M.S., Chen, C.-C., Aggarwal, J.K., Roy-Chowdhury, A.: An Overview of Contest on Semantic Description of Human Activities (SDHA) 2010. In: Ünay, D., Çataltepe, Z., Aksoy, S. (eds.) ICPR 2010. LNCS, vol. 6388, pp. 270–285. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  25. 25.
    Waltisberg, D., Yao, A., Gall, J., Van Gool, L.: Variations of a Hough-Voting Action Recognition System. In: Ünay, D., Çataltepe, Z., Aksoy, S. (eds.) ICPR 2010. LNCS, vol. 6388, pp. 306–312. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  26. 26.
    Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In: ICCV (2009)Google Scholar
  27. 27.
    Zhang, Y., Ge, W., Chang, M.C., Liu, X.: Group context learning for event recognition. In: WACV (2012)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Yimeng Zhang
    • 1
  • Xiaoming Liu
    • 2
  • Ming-Ching Chang
    • 2
  • Weina Ge
    • 2
  • Tsuhan Chen
    • 1
  1. 1.School of Electrical and Computer EngineeringCornell UniversityUSA
  2. 2.GE Global Research Center, 1 Research CircleNiskayunaUSA

Personalised recommendations