International Journal of Computer Vision

, Volume 82, Issue 1, pp 1–24 | Cite as

Semantic Representation and Recognition of Continued and Recursive Human Activities



This paper describes a methodology for automated recognition of complex human activities. The paper proposes a general framework which reliably recognizes high-level human actions and human-human interactions. Our approach is a description-based approach, which enables a user to encode the structure of a high-level human activity as a formal representation. Recognition of human activities is done by semantically matching constructed representations with actual observations. The methodology uses a context-free grammar (CFG) based representation scheme as a formal syntax for representing composite activities. Our CFG-based representation enables us to define complex human activities based on simpler activities or movements. Our system takes advantage of both statistical recognition techniques from computer vision and knowledge representation concepts from traditional artificial intelligence. In the low-level of the system, image sequences are processed to extract poses and gestures. Based on the recognition of gestures, the high-level of the system hierarchically recognizes composite actions and interactions occurring in a sequence of image frames. The concept of hallucinations and a probabilistic semantic-level recognition algorithm is introduced to cope with imperfect lower-layers. As a result, the system recognizes human activities including ‘fighting’ and ‘assault’, which are high-level activities that previous systems had difficulties. The experimental results show that our system reliably recognizes sequences of complex human activities with a high recognition rate.


Human activity recognition Event detection Semantic-level video analysis Hierarchical action recognition 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Allen, J. F., & Ferguson, G. (1994). Actions and events in interval temporal logic. Journal of Logic and Computation, 4(5), 531–579. MATHCrossRefMathSciNetGoogle Scholar
  2. Bobick, A. F., & Wilson, A. D. (1997). A state-based approach to the representation and recognition of gesture. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(12), 1325–1337. CrossRefGoogle Scholar
  3. Chomsky, N. (1956). Three models for the description of language. IEEE Transactions on Information Theory, 2(3), 113–124. CrossRefGoogle Scholar
  4. Fine, S., Singer, Y., & Tishby, N. (1998). The hierarchical hidden Markov model: Analysis and applications. Machine Learning, 32(1), 41–62. MATHCrossRefGoogle Scholar
  5. Hongeng, S., Nevatia, R., & Bremond, F. (2004). Video-based event recognition: Activity representation and probabilistic recognition methods. Computer Vision and Image Understanding: CVIU, 96(2), 129–162. CrossRefGoogle Scholar
  6. Ivanov, Y. A., & Bobick, A. F. (2000). Recognition of visual activities and interactions by stochastic parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 852–872. CrossRefGoogle Scholar
  7. Joo, S.-W., & Chellappa, R. (2006). Attribute grammar-based event recognition and anomaly detection. In CVPRW ’06: Proceedings of the 2006 conference on computer vision and pattern recognition workshop (p. 107). Google Scholar
  8. Minnen, D., Essa, I. A., & Starner, T. (2003). Expectation grammars: Leveraging high-level expectations for activity recognition. In CVPR(2) (pp. 626–632). IEEE Computer Society. Google Scholar
  9. Moore, D. J., & Essa, I. A. (2002). Recognizing multitasked activities from video using stochastic context-free grammar. In AAAI/IAAI (pp. 770–776). Google Scholar
  10. Natarajan, P., & Nevatia, R. (2007). Coupled hidden semi Markov models for activity recognition. In IEEE workshop on motion and video computing, 2007. WMVC ’07 (pp. 10–10). Google Scholar
  11. Nevatia, R., Hobbs, J., & Bolles, B. (2004). An ontology for video event representation. In CVPRW ’04: Proceedings of the 2004 conference on computer vision and pattern recognition workshop (CVPRW’04) (Vol. 7, p. 119). Google Scholar
  12. Nguyen, N. T., Phung, D. Q., Venkatesh, S., & Bui, H. H. (2005). Learning and detecting activities from movement trajectories using the hierarchical hidden Markov models. In CVPR(2) (pp. 955–960). IEEE Computer Society. Google Scholar
  13. Oliver, N. M., Rosario, B., & Pentland, A. P. (2000). A Bayesian computer vision system for modeling human interactions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 831–843. CrossRefGoogle Scholar
  14. Park, S., & Aggarwal, J. K. (2004a). A hierarchical Bayesian network for event recognition of human actions and interactions. Multimedia Systems, 10(2), 164–179. CrossRefGoogle Scholar
  15. Park, S., & Aggarwal, J. K. (2004b). Semantic-level understanding of human actions and interactions using event hierarchy. In CVPRW ’04: Proceedings of the 2004 conference on computer vision and pattern recognition workshop (CVPRW’04) (Vol. 1, p. 12). Google Scholar
  16. Park, S., & Aggarwal, J. K. (2006). Simultaneous tracking of multiple body parts of interacting persons. Computer Vision and Image Understanding, 102(1), 1–21. CrossRefGoogle Scholar
  17. Pinhanez, C. (1999). Representation and recognition of action in interactive spaces. MIT Media Lab, June 1999: PhD thesis. Google Scholar
  18. Ryoo, M. S., & Aggarwal, J. K. (2006a). Recognition of composite human activities through context-free grammar based representation. In CVPR ’06: Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition (pp. 1709–1718). Google Scholar
  19. Ryoo, M. S., & Aggarwal, J. K. (2006b). Semantic understanding of continued and recursive human activities. In ICPR ’06: Proceedings of the 18th international conference on pattern recognition (pp. 379–382). Google Scholar
  20. Ryoo, M. S., & Aggarwal, J. K. (2007). Robust human-computer interaction system guiding a user by providing feedback. In M. M. Veloso (Ed.), IJCAI 2007, Proceedings of the 20th international joint conference on artificial intelligence (pp. 2850–2855). Google Scholar
  21. Shi, Y., Huang, Y., Minnen, D., Bobick, A. F., & Essa, I. A. (2004). Propagation networks for recognition of partially ordered sequential action. In CVPR(2) (pp. 862–869). Google Scholar
  22. Siskind, J. M. (2001). Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic. Journal of Artificial Intelligence Research (JAIR), 15, 31–90. MATHGoogle Scholar
  23. Starner, T., & Pentland, A. (1995). Real-time American sign language recognition from video using hidden Markov models. ISCV, 00, 265. Google Scholar
  24. Vu, V.-T., Brémond, F., & Thonnat, M. (2003). Automatic video interpretation: A novel algorithm for temporal scenario recognition. In G. Gottlob & T. Walsh (Eds.), IJCAI-03, Proceedings of the eighteenth international joint conference on artificial intelligence (pp. 1295–1302). San Mateo: Morgan Kaufmann. Google Scholar
  25. Yamato, J., Ohya, J., & Ishii, K. (1992). Recognizing human action in time-sequential images using hidden Markov model. In CVPR (pp. 379–385). Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  1. 1.Computer and Vision Research CenterThe University of Texas at AustinAustinUSA

Personalised recommendations