The Visual Computer

, Volume 27, Issue 12, pp 1115–1123 | Cite as

Simultaneous tracking and action recognition for single actor human actions

  • Vivek Kumar Singh
  • Ram Nevatia
Original Article


This paper presents an approach to simultaneously tracking the pose and recognizing human actions in a video. This is achieved by combining a Dynamic Bayesian Action Network (DBAN) with 2D body part models. Existing DBAN implementation relies on fairly weak observation features, which affects the recognition accuracy. In this work, we use a 2D body part model for accurate pose alignment, which in turn improves both pose estimate and action recognition accuracy. To compensate for the additional time required for alignment, we use an action entropy-based scheme to determine the minimum number of states to be maintained in each frame while avoiding sample impoverishment. In addition, we also present an approach to automation of the keypose selection task for learning 3D action models from a few annotations. We demonstrate our approach on a hand gesture dataset with 500 action sequences, and we show that compared to DBAN our algorithm achieves 6% improvement in accuracy.


Human action recognition Dynamic Bayesian network Pictorial structure 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Collins, M.: Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), July 2002, pp. 1–8 (2002) CrossRefGoogle Scholar
  2. 2.
    Fathi, A., Mori, G.: Action recognition by learning mid-level motion features. In: Computer Vision and Pattern Recognition (CVPR) (2008) Google Scholar
  3. 3.
    Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. Int. J. Comput. Vis. 61(1), 55–79 (2005) CrossRefGoogle Scholar
  4. 4.
    Gupta, A., Chen, F., Kimber, D., Davis, L.S.: Context and observation driven latent variable model for human pose estimation. In: Computer Vision and Pattern Recognition (CVPR) (2008) Google Scholar
  5. 5.
    Ikizler, N., Forsyth, D.A.: Searching video for complex activities with finite state models. In: Computer Vision and Pattern Recognition (CVPR) (2007) Google Scholar
  6. 6.
    Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: International Conference on Computer Vision (ICCV) (2007) Google Scholar
  7. 7.
    Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005) CrossRefGoogle Scholar
  8. 8.
    Lee, M.W., Nevatia, R.: Human pose tracking using multi-level structured models. In: ECCV (3), pp. 368–381 (2006) Google Scholar
  9. 9.
    Lourakis, M.: Levmar: Levenberg-Marquardt nonlinear least squares algorithms in C/C++. [web page]., Jul. 2004. [Accessed on 31 Jan. 2005.]
  10. 10.
    Lv, F., Nevatia, R.: Single view human action recognition using key pose matching and Viterbi path searching. In: Computer Vision and Pattern Recognition (CVPR) (2007) Google Scholar
  11. 11.
    Morency, L.-P., Quattoni, A., Darrell, T.: Latent-dynamic discriminative models for continuous gesture recognition. In: Computer Vision and Pattern Recognition (2007) Google Scholar
  12. 12.
    Natarajan, P., Nevatia, R.: View and scale invariant action recognition using multiview shape-flow models. In: CVPR (2008) Google Scholar
  13. 13.
    Natarajan, P., Singh, V.K., Nevatia, R.: Learning 3d action models from a few 2d videos for view invariant action recognition. In: CVPR (2010) Google Scholar
  14. 14.
    Shet, V., Prasad, S.N., Elgammal, A., Yacoob, Y., Davis, L.: Multi-cue exemplar-based nonparametric model for gesture recognition. In: ICVGIP (2004) Google Scholar
  15. 15.
    Sigal, L., Black, M.J.: Measure locally, reason globally: Occlusion-sensitive articulated pose estimation. In: CVPR, pp. 2041–2048 (2006) Google Scholar
  16. 16.
    Singh, V.K., Nevatia, R.: Human action recognition using a dynamic Bayesian action network with 2D part models. In: Proceedings of the Seventh Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP ’10, pp. 17–24 (2010) CrossRefGoogle Scholar
  17. 17.
    Sminchisescu, C., Kanaujia, A., Li, Z., Metaxas, D.: Conditional random fields for contextual human motion recognition. In: International Conference on Computer Vision (ICCV), pp. 1808–1815 (2005) Google Scholar
  18. 18.
    Taylor, C.J.: Reconstruction of articulated objects from point correspondences in a single uncalibrated image. In: Computer Vision and Image Understanding (CVIU), vol. 80, pp. 349–363 (2000) Google Scholar
  19. 19.
    Urtasun, R., Fleet, D.J., Fua, P.: 3D people tracking with Gaussian process dynamical models. In: Computer Vision and Pattern Recognition (CVPR), pp. 238–245 (2006) Google Scholar
  20. 20.
    Weinland, D., Ronfard, R., Boyer, E.: Automatic discovery of action taxonomies from multiple views. In: Computer Vision and Pattern Recognition (CVPR), vol. II, pp. 1639–1645 (2006) Google Scholar
  21. 21.
    Wu, B., Nevatia, R.: Detection of multiple, partially occluded humans in a single image by Bayesian combination of Edgelet part detectors. In: ICCV, pp. 90–97 (2005) Google Scholar

Copyright information

© Springer-Verlag 2011

Authors and Affiliations

  1. 1.University of Southern CaliforniaLos AngelesUSA

Personalised recommendations