Video Retrieval of Human Interactions Using Model-Based Motion Tracking and Multi-layer Finite State Automata

  • Sangho Park
  • Jihun Park
  • Jake K. Aggarwal
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2728)


Recognition of human interactions in a video is useful for video annotation, automated surveillance, and content-based video retrieval. This paper presents a model-based approach to motion tracking and recognition of human interactions using multi-layer finite state automata (FA). The system is used for widely-available, static-background monocular surveillance videos. A three-dimensional human body model is built using a sphere and cylinders and is projected on a two-dimensional image plane to fit the foreground image silhouette. We convert the human motion tracking problem into a parameter optimization problem without the need to compute inverse kinematics. A cost functional is used to estimate the degree of the overlap between the foreground input image silhouette and a projected three-dimensional body model silhouette. Motion data obtained from the tracker is analyzed in terms of feet, torso, and hands by a behavior recognition system. The recognition model represents human behavior as a sequence of states that register the configuration of individual body parts in space and time. In order to overcome the exponential growth of the number of states that usually occurs in single-level FA, we propose a multi-layer FA that abstracts states and events from motion data at multiple levels: low-level FA analyzes body parts only, and high-level FA analyzes the human interaction. Motion tracking results from video sequences are presented. Our recognition framework successfully recognizes various human interactions such as approaching, departing, pushing, pointing, and handshaking.


Human Interaction Body Model Motion Tracking Video Retrieval Human Body Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Kim, K., Choi, J., Kim, N., Kim, P.: Extracting semantic information from basketball video based on audio-visual features. Lecture Notes in Computer Science 2383 (2002) 268–277Google Scholar
  2. [2]
    Chang, Y., Zeng, W., Camel, I., Aonso, R.: Integrated image and speech analysis for content-based video indexing. In: IEEE proc. Int’l Conference on Multimedia Computing and Systems. (1996) 306–313Google Scholar
  3. [3]
    Denman, H., Rea, N., Kokaram, A.: Content based analysis for video from snooker broadcasts. In: Int’l Conference on Image and Video Retrieval, Lecture Notes in Computer Science. Volume 2383., Springer (2002) 186–193Google Scholar
  4. [4]
    Aggarwal, J., Cai, Q.: Human motion analysis: a review. Computer Vision and Image Understanding 73(3) (1999) 295–304CrossRefGoogle Scholar
  5. [5]
    Morris, D., Rehg, J.: Singularity analysis for articulated object tracking. In: Computer Vision and Pattern Recognition, Santa Barbara, California (1998) 289–296Google Scholar
  6. [6]
    Huang, Y., Huang, T. S.: Model-based human body tracking. In: International Conference on Pattern Recognition. Volume 1., Quebec city, Canada (2002) 552–555Google Scholar
  7. [7]
    Sidenbladh, H., Black, M. J., Fleet, D. J.: Stochastic tracking of 3d human figures using 2d image motion. In: ECCV (2). (2000) 702–718Google Scholar
  8. [8]
    Oliver, N.M., Rosario, B., Pentland, A.P.: A Bayesian Computer Vision System for Modeling Human Interactions. IEEE Trans. Pattern Analysis and Machine Intelligence 22 (2000) 831–843CrossRefGoogle Scholar
  9. [9]
    Hongeng, S., Bremond, F., Nevatia, R.: Representation and optimal recognition of human activities. In: IEEE Conf. on Computer Vision and Pattern Recognition. Volume 1. (2000) 818–825Google Scholar
  10. [10]
    Park, S., Aggarwal, J.: Recognition of human interaction using multiple features in grayscale images. In: Int’l Conference on Pattern Recognition. Volume 1., Barcelona, Spain (2000) 51–54zbMATHGoogle Scholar
  11. [11]
    Hong, P., Turk, M., Huang, T. S.: Gesture modeling and recognition using finite state machines. In: IEEE Conf. on Face and Gesture Recognition. (2000)Google Scholar
  12. [12]
    Wada, T., Matsuyama, T.: Multiobject behavior recognition by event driven selective attention method. IEEE transaction on Pattern Analysis and Machine Intelligence 22 (2000) 873–887CrossRefGoogle Scholar
  13. [13]
    Park, S., Aggarwal, J.: Segmentation and tracking of interacting human body parts under occlusion and shadowing. In: IEEE Workshop on Motion and Video Computing, Orlando, FL (2002) 105–111Google Scholar
  14. [14]
    Gavrila, D.M., Philomin, V.: Real-time object detection using distance transforms. In: Proc. IEEE International Conference on Intelligent Vehicles, Stuttgart, Germany (1998) 274–279Google Scholar
  15. [15]
    Hill, F.: Computer Graphics. Macmillan (1990)Google Scholar
  16. [16]
    Lasdon, L., Waren, A.: GRG2 User’s Guide. (1989)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Sangho Park
    • 1
  • Jihun Park
    • 2
  • Jake K. Aggarwal
    • 1
  1. 1.Department of Electrical and Computer EngineeringThe University of Texas at AustinAustin
  2. 2.Department of Computer EngineeringHongik UniversitySeoulKorea

Personalised recommendations