International Journal of Computer Vision

, Volume 100, Issue 1, pp 16–37 | Cite as

Coupled Action Recognition and Pose Estimation from Multiple Views

Article

Abstract

Action recognition and pose estimation are two closely related topics in understanding human body movements; information from one task can be leveraged to assist the other, yet the two are often treated separately. We present here a framework for coupled action recognition and pose estimation by formulating pose estimation as an optimization over a set of action-specific manifolds. The framework allows for integration of a 2D appearance-based action recognition system as a prior for 3D pose estimation and for refinement of the action labels using relational pose features based on the extracted 3D poses. Our experiments show that our pose estimation system is able to estimate body poses with high degrees of freedom using very few particles and can achieve state-of-the-art results on the HumanEva-II benchmark. We also thoroughly investigate the impact of pose estimation and action recognition accuracy on each other on the challenging TUM kitchen dataset. We demonstrate not only the feasibility of using extracted 3D poses for action recognition, but also improved performance in comparison to action recognition using low-level appearance features.

Keywords

Human pose estimation Human action recognition Tracking Stochastic optimization Hough transform 

References

  1. Agarwal, A., & Triggs, B. (2006). Recovering 3d human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(1), 44–58. CrossRefGoogle Scholar
  2. Aggarwal, J., & Ryoo, M. (2010). Human activity analysis: a review. ACM Computing Surveys. Google Scholar
  3. Ali, S., Basharat, A., & Shah, M. (2007). Chaotic invariants for human action recognition. In Proceedings international conference on computer vision. Google Scholar
  4. Andriluka, M., Roth, S., & Schiele, B. (2010). Monocular 3d pose estimation and tracking by detection. In Proceedings IEEE conference on computer vision and pattern recognition. Google Scholar
  5. Baak, A., Rosenhahn, B., Mueller, M., & Seidel, H. P. (2009). Stabilizing motion tracking using retrieved motion priors. In Proceedings international conference on computer vision. Google Scholar
  6. Baumberg, A., & Hogg, D. (1994). An efficient method for contour tracking using active shape models. In Proceeding of the workshop on motion of nonrigid and articulated objects. Los Alamitos: IEEE Computer Society. Google Scholar
  7. Belkin, M., & Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for embedding and clustering. In Neural information processing systems. Google Scholar
  8. Bergtholdt, M., Kappes, J., Schmidt, S., & Schnörr, C. (2010). A study of parts-based object class detection using complete graphs. International Journal of Computer Vision, 87, 93–117. MathSciNetCrossRefGoogle Scholar
  9. Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. In Proceedings international conference on computer vision. Google Scholar
  10. Bo, L., & Sminchisescu, C. (2010). Twin Gaussian processes for structured prediction. International Journal of Computer Vision, 87, 28–52. CrossRefGoogle Scholar
  11. Bobick, A., & Davis, J. (2001). The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3), 257–267. CrossRefGoogle Scholar
  12. Brox, T., Bruhn, A., Papenberg, N., & Weickert, J. (2004). High accuracy optical flow estimation based on a theory for warping. In Proceedings European conference on computer vision. Google Scholar
  13. Brubaker, M., Fleet, D., & Hertzmann, A. (2010). Physics-based person tracking using the anthropomorphic walker. International Journal of Computer Vision, 87, 140–155. CrossRefGoogle Scholar
  14. Campbell, L., & Bobick, A. (1995). Recognition of human body motion using phase space constraints. In Proceedings international conference on computer vision. Google Scholar
  15. Chen, J., Kim, M., Wang, Y., & Ji, Q. (2009). Switching Gaussian process dynamic models for simultaneous composite motion tracking and recognition. In Proceedings IEEE conference on computer vision and pattern recognition. Google Scholar
  16. Corazza, S., Mündermann, L., Gambaretto, E., Ferrigno, G., & Andriacchi, T. (2010). Markerless motion capture through visual hull, articulated icp and subject specific model generation. International Journal of Computer Vision, 87, 156–169. CrossRefGoogle Scholar
  17. Darby, J., Li, B., & Costen, N. (2010). Tracking human pose with multiple activity models. Pattern Recognition, 43, 3042–3058. MATHCrossRefGoogle Scholar
  18. Del Moral, P. (2004). Feynman-Kac formulae. Genealogical and interacting particle systems with applications. New York: Springer. MATHGoogle Scholar
  19. Deutscher, J., & Reid, I. (2005). Articulated body motion capture by stochastic search. International Journal of Computer Vision, 61, 2. CrossRefGoogle Scholar
  20. Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance (VS-PETS). Google Scholar
  21. Efros, A., Berg, A., Mori, G., & Malik, J. (2003). Recognizing action at a distance. In Proceedings international conference on computer vision. Google Scholar
  22. Elgammal, A., & Lee, C. S. (2004). Inferring 3d body pose from silhouettes using activity manifold learning. In Proceedings IEEE conference on computer vision and pattern recognition. Google Scholar
  23. Forsyth, D., Arikan, O., Ikemoto, L., O’Brien, J., & Ramanan, D. (2006). Computational studies of human motion: Part 1, tracking and motion synthesis. Foundations and Trends in Computer Graphics and Vision, 1. Google Scholar
  24. Gall, J., Rosenhahn, B., & Seidel, H. P. (2008a). Drift-free tracking of rigid and articulated objects. In Proceedings IEEE conference on computer vision and pattern recognition. Google Scholar
  25. Gall, J., Rosenhahn, B., & Seidel, H. P. (2008b). An introduction to interacting simulated annealing. In Human motion: understanding, modelling, capture and animation (pp. 319–343). Berlin: Springer. Google Scholar
  26. Gall, J., Stoll, C., de Aguiar, E., Theobalt, C., Rosenhahn, B., & Seidel, H. P. (2009). Motion capture using joint skeleton tracking and surface estimation. In Proceedings IEEE conference on computer vision and pattern recognition (pp. 1746–1753). Google Scholar
  27. Gall, J., Rosenhahn, B., Brox, T., & Seidel, H. P. (2010a). Optimization and filtering for human motion capture—a multi-layer framework. International Journal of Computer Vision, 87, 75–92. CrossRefGoogle Scholar
  28. Gall, J., Yao, A., & Van Gool, L. (2010b). 2d action recognition serves 3d human pose estimation. In Proceedings European conference on computer vision. Google Scholar
  29. Gall, J., Yao, A., Razavi, N., Van Gool, L., & Lempitsky, V. (2011). Hough forests for object detection, tracking, and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. Google Scholar
  30. Gavrila, D., & Davis, L. (1995). Towards 3-d model-based tracking and recognition of human movement: a multi-view approach. In International workshop on face and gesture recognition. Google Scholar
  31. Geiger, A., Urtasun, R., & Darrell, T. (2009). Rank priors for continuous non-linear dimensionality reduction. In Proceedings IEEE conference on computer vision and pattern recognition. Google Scholar
  32. Hou, S., Galata, A., Caillette, F., Thacker, N., & Bromiley, P. (2007). Real-time body tracking using a Gaussian process latent variable model. In Proceedings international conference on computer vision. Google Scholar
  33. Husz, Z. L., Wallace, A. M., & Green, P. R. (2011) Behavioural analysis with movement cluster model for concurrent actions. EURASIP Journal on Image and Video Processing. Google Scholar
  34. Jaeggli, T., Koller-Meier, E., & Van Gool, L. (2009). Learning generative models for multi-activity body pose estimation. International Journal of Computer Vision, 83(2), 121–134. CrossRefGoogle Scholar
  35. Jenkins, O. C., Serrano, G. G., & Loper, M. M. (2007). Interactive human pose and action recognition using dynamical motion primitives. International Journal of Humanoid Robotics, 4(2), 365–385. CrossRefGoogle Scholar
  36. Jhuang, H., Serre, T., Wolf, L., & Poggio, T. (2007). A biologically inspired system for action recognition. In Proceedings international conference on computer vision. Google Scholar
  37. Kittler, J., Hatef, M., Duin, R., & Matas, J. (1998). On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 226–239. CrossRefGoogle Scholar
  38. Kläser, A., Marszałek, M., Schmid, C., & Zisserman, A. (2010). Human focused action localization in video. In International workshop on sign, gesture, and activity. Google Scholar
  39. Kovar, L., & Gleicher, M. (2004). Automated extraction and parameterization of motions in large data sets. ACM Transactions on Graphics, 23, 559–568. CrossRefGoogle Scholar
  40. Laptev, I., & Lindeberg, T. (2003). Space-time interest points. In Proceedings international conference on computer vision. Google Scholar
  41. Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In Proceedings IEEE conference on computer vision and pattern recognition. Google Scholar
  42. Lawrence, N. (2005). Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6, 1783–1816. MathSciNetMATHGoogle Scholar
  43. Lee, C., & Elgammal, A. (2010). Coupled visual and kinematic manifold models for tracking. International Journal of Computer Vision, 87, 118–139. CrossRefGoogle Scholar
  44. Li, R., Tian, T., & Sclaroff, S. (2007). Simultaneous learning of non-linear manifold and dynamical models for high-dimensional time series. In Proceedings international conference on computer vision. Google Scholar
  45. Li, R., Tian, T., Sclaroff, S., & Yang, M. (2010). 3d human motion tracking with a coordinated mixture of factor analyzers. International Journal of Computer Vision, 87, 170–190. CrossRefGoogle Scholar
  46. Lin, R., Liu, C., Yang, M., Ahja, N., & Levinson, S. (2006). Learning nonlinear manifolds from time series. In Proceedings European conference on computer vision. Google Scholar
  47. Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from videos ‘in the wild’. In Proceedings IEEE conference on computer vision and pattern recognition. Google Scholar
  48. Lv, F., & Nevatia, R. (2007). Single view human action recognition using key pose matching and Viterbi path searching. In Proceedings IEEE conference on computer vision and pattern recognition. Google Scholar
  49. Maji, S., Bourdev, L., & Malik, J. (2011). Action recognition from a distributed representation of pose and appearance. In Proceedings IEEE conference on computer vision and pattern recognition. Google Scholar
  50. Mitra, S., & Acharya, T. (2007). Gesture recognition: a survey. IEEE Transactions on Systems, Man and Cybernetics - Part C, 37(3), 311–324. CrossRefGoogle Scholar
  51. Moeslund, T., Hilton, A., & Krüger, V. (2006). A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding, 104(2), 90–126. CrossRefGoogle Scholar
  52. Moon, K., & Pavlovic, V. (2006). Impact of dynamics on subspace embedding and tracking of sequences. In Proceedings IEEE conference on computer vision and pattern recognition (pp. 198–205). Google Scholar
  53. Müller, M., Röder, T., & Clausen, M. (2005). Efficient content-based retrieval of motion capture data. ACM Transactions on Graphics, 24, 677–685. CrossRefGoogle Scholar
  54. Natarajan, P., Singh, V., & Nevatia, R. (2010). Learning 3d action models from a few 2d videos for view invariant action recognition. In Proceedings IEEE conference on computer vision and pattern recognition. Google Scholar
  55. Pavlovic, V., Rehg, J., & Maccormick, J. (2000). Learning switching linear models of human motion. In Neural information processing systems (pp. 981–987). Google Scholar
  56. Peursum, P., Venkatesh, S., & West, G. (2010). A study on smoothing for particle-filtered 3d human body tracking. International Journal of Computer Vision, 87, 53–74. CrossRefGoogle Scholar
  57. Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing. Google Scholar
  58. Rao, C., Yilmaz, A., & Shah, M. (2002). View-invariant representation and recognition of actions. International Journal of Computer Vision, 50(2), 203–226. MATHCrossRefGoogle Scholar
  59. Raskin, L., Rudzsky, M., & Rivlin, E. (2011). Dimensionality reduction using a Gaussian process annealed particle filter for tracking and classification of articulated body motions. Computer Vision and Image Understanding, 115(4), 503–519. CrossRefGoogle Scholar
  60. Rasmussen, C., & Williams, C. (2006). Gaussian processes for machine learning. Cambridge: MIT Press. MATHGoogle Scholar
  61. Rodriguez, M., Ahmed, J., & Shah, M. (2008). Action Mach: a spatio-temporal maximum average correlation height filter for action recognition. In Proceedings IEEE conference on computer vision and pattern recognition. Google Scholar
  62. Rosales, R., & Sclaroff, S. (2001). Learning body pose via specialized maps. In Neural information processing systems. Google Scholar
  63. Rosenhahn, B., Brox, T., & Seidel, H. P. (2007). Scaled motion dynamics for markerless motion capture. In Proceedings IEEE conference on computer vision and pattern recognition. Google Scholar
  64. Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally Linear embedding. Science, 290(5500), 2323–2326. CrossRefGoogle Scholar
  65. Schindler, K., & Van Gool, L. (2008). Action snippets: how many frames does human action recognition require. In Proceedings IEEE conference on computer vision and pattern recognition. Google Scholar
  66. Schmaltz, C., Rosenhahn, B., Brox, T., & Weickert, J. (2011). Region-based pose tracking with occlusions using 3d models. In Machine vision and applications (pp. 1–21). Google Scholar
  67. Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: a local svm approach. In Proceedings international conference on pattern recognition. Google Scholar
  68. Shaheen, M., Gall, J., Strzodka, R., Van Gool, L., & Seidel, H. P. (2009). A comparison of 3d model-based tracking approaches for human motion capture in uncontrolled environments. In IEEE workshop on applications of computer vision. Google Scholar
  69. Sidenbladh, H., Black, M., & Fleet, D. (2000). Stochastic tracking of 3d human figures using 2d image motion. In Proceedings European conference on computer vision. Google Scholar
  70. Sidenbladh, H., Black, M., & Sigal, L. (2002). Implicit probabilistic models of human motion for synthesis and tracking. In Proceedings European conference on computer vision (pp. 784–800). Google Scholar
  71. Sigal, L., Balan, A., & Black, M. (2010). Humaneva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(1–2), 4–27. CrossRefGoogle Scholar
  72. Sminchisescu, C., & Jepson, A. (2004). Generative modeling for continuous non-linearly embedded visual inference. In Proceedings international conference on machine learning. Google Scholar
  73. Sminchisescu, C., Kanaujia, A., & Metaxas, D. (2007). Bm3e: discriminative density propagation for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(11), 2030–2044. CrossRefGoogle Scholar
  74. Taycher, L., Demirdjian, D., Darrell, T., & Shakhnarovich, G. (2006). Conditional random people: tracking humans with crfs and grid filters. In Proceedings IEEE conference on computer vision and pattern recognition (pp. 222–229). Google Scholar
  75. Taylor, G., Sigal, L., Fleet, D., & Hinton, G. (2010). Dynamical binary latent variable models for 3d human pose tracking. In Proceedings IEEE conference on computer vision and pattern recognition. Google Scholar
  76. Tenenbaum, J., de Silva, V., & Langford, J. (2000). A global geometric framework for nonlinear dimensionality reduction. Chicago: Science. Google Scholar
  77. Tenorth, M., Bandouch, J., & Beetz, M. (2009). The TUM kitchen data set of everyday manipulation activities for motion tracking and action recognition. In IEEE workshop on tracking humans for the evaluation of their motion in image sequences. Google Scholar
  78. Thurau, C., & Hlavac, V. (2008). Pose primitive based human action recognition in videos or still images. In Proceedings IEEE conference on computer vision and pattern recognition. Google Scholar
  79. Ukita, N., Hirai, M., & Kidode, M. (2009). Complex volume and pose tracking with probabilistic dynamical model and visual hull constraint. In Proceedings international conference on computer vision. Google Scholar
  80. Urtasun, R., Fleet, D., & Fua, P. (2006). 3d people tracking with Gaussian process dynamical models. In Proceedings IEEE conference on computer vision and pattern recognition. Google Scholar
  81. Urtasun, R., Fleet, D., Hertzman, A., & Fua, P. (2005). Priors for people tracking from small training sets. In Proceedings international conference on computer vision. Google Scholar
  82. Wang, J., Fleet, D., & Hertzmann, A. (2008). Gaussian process dynamical models for human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2), 283–298. CrossRefGoogle Scholar
  83. Weinland, D., & Boyer, E. (2008). Action recognition using exemplar-based embedding. In Proceedings IEEE conference on computer vision and pattern recognition. Google Scholar
  84. Weinland, D., Boyer, E., & Ronfard, R. (2007). Action recognition from arbitrary views using 3d exemplars. In Proceedings international conference on computer vision. Google Scholar
  85. Willems, G., Becker, J., Tuytelaars, T., & Van Gool, L. (2009). Exemplar-based action recognition in video. In Proceedings British machine vision conference. Google Scholar
  86. Yacoob, Y., & Black, M. (1999). Parameterized modeling and recognition of activities. Computer Vision and Image Understanding, 73(2), 232–247. CrossRefGoogle Scholar
  87. Yang, W., Wang, Y., & Mori, G. (2010). Recognizing human actions from still images with latent poses. In Proceedings IEEE conference on computer vision and pattern recognition. Google Scholar
  88. Yao, A., Gall, J., & Van Gool, L. (2010). A hough transform-based voting framework for action recognition. In Proceedings IEEE conference on computer vision and pattern recognition. Google Scholar
  89. Yao, A., Gall, J., Fanelli, G., & Van Gool, L. (2011). Does human action recognition benefit from pose estimation. In Proceedings British machine vision conference. Google Scholar
  90. Yilmaz, A., & Shah, M. (2005). Recognizing human actions in videos acquired by uncalibrated moving cameras. In Proceedings international conference on computer vision. Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Computer Vision LaboratoryETH ZurichZurichSwitzerland
  2. 2.Max Planck Institute for Intelligent SystemsTubingenGermany
  3. 3.Department of Electrical Engineering/IBBTK.U. LeuvenHeverleeBelgium

Personalised recommendations