Video Retrieval of Human Interactions Using Model-Based Motion Tracking and Multi-layer Finite State Automata
Recognition of human interactions in a video is useful for video annotation, automated surveillance, and content-based video retrieval. This paper presents a model-based approach to motion tracking and recognition of human interactions using multi-layer finite state automata (FA). The system is used for widely-available, static-background monocular surveillance videos. A three-dimensional human body model is built using a sphere and cylinders and is projected on a two-dimensional image plane to fit the foreground image silhouette. We convert the human motion tracking problem into a parameter optimization problem without the need to compute inverse kinematics. A cost functional is used to estimate the degree of the overlap between the foreground input image silhouette and a projected three-dimensional body model silhouette. Motion data obtained from the tracker is analyzed in terms of feet, torso, and hands by a behavior recognition system. The recognition model represents human behavior as a sequence of states that register the configuration of individual body parts in space and time. In order to overcome the exponential growth of the number of states that usually occurs in single-level FA, we propose a multi-layer FA that abstracts states and events from motion data at multiple levels: low-level FA analyzes body parts only, and high-level FA analyzes the human interaction. Motion tracking results from video sequences are presented. Our recognition framework successfully recognizes various human interactions such as approaching, departing, pushing, pointing, and handshaking.
KeywordsHuman Interaction Body Model Motion Tracking Video Retrieval Human Body Model
Unable to display preview. Download preview PDF.
- Kim, K., Choi, J., Kim, N., Kim, P.: Extracting semantic information from basketball video based on audio-visual features. Lecture Notes in Computer Science 2383 (2002) 268–277Google Scholar
- Chang, Y., Zeng, W., Camel, I., Aonso, R.: Integrated image and speech analysis for content-based video indexing. In: IEEE proc. Int’l Conference on Multimedia Computing and Systems. (1996) 306–313Google Scholar
- Denman, H., Rea, N., Kokaram, A.: Content based analysis for video from snooker broadcasts. In: Int’l Conference on Image and Video Retrieval, Lecture Notes in Computer Science. Volume 2383., Springer (2002) 186–193Google Scholar
- Morris, D., Rehg, J.: Singularity analysis for articulated object tracking. In: Computer Vision and Pattern Recognition, Santa Barbara, California (1998) 289–296Google Scholar
- Huang, Y., Huang, T. S.: Model-based human body tracking. In: International Conference on Pattern Recognition. Volume 1., Quebec city, Canada (2002) 552–555Google Scholar
- Sidenbladh, H., Black, M. J., Fleet, D. J.: Stochastic tracking of 3d human figures using 2d image motion. In: ECCV (2). (2000) 702–718Google Scholar
- Hongeng, S., Bremond, F., Nevatia, R.: Representation and optimal recognition of human activities. In: IEEE Conf. on Computer Vision and Pattern Recognition. Volume 1. (2000) 818–825Google Scholar
- Hong, P., Turk, M., Huang, T. S.: Gesture modeling and recognition using finite state machines. In: IEEE Conf. on Face and Gesture Recognition. (2000)Google Scholar
- Park, S., Aggarwal, J.: Segmentation and tracking of interacting human body parts under occlusion and shadowing. In: IEEE Workshop on Motion and Video Computing, Orlando, FL (2002) 105–111Google Scholar
- Gavrila, D.M., Philomin, V.: Real-time object detection using distance transforms. In: Proc. IEEE International Conference on Intelligent Vehicles, Stuttgart, Germany (1998) 274–279Google Scholar
- Hill, F.: Computer Graphics. Macmillan (1990)Google Scholar
- Lasdon, L., Waren, A.: GRG2 User’s Guide. (1989)Google Scholar