Full-Body Human Motion Capture from Monocular Depth Images

  • Thomas Helten
  • Andreas Baak
  • Meinard Müller
  • Christian Theobalt
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8200)


Optical capturing of human body motion has many practical applications, ranging from motion analysis in sports and medicine, over ergonomy research, up to computer animation in game and movie production. Unfortunately, many existing approaches require expensive multi-camera systems and controlled studios for recording, and expect the person to wear special marker suits. Furthermore, marker-less approaches demand dense camera arrays and indoor recording. These requirements and the high acquisition cost of the equipment makes it applicable only to a small number of people. This has changed in recent years, when the availability of inexpensive depth sensors, such as time-of-flight cameras or the Microsoft Kinect has spawned new research on tracking human motions from monocular depth images. These approaches have the potential to make motion capture accessible to much larger user groups. However, despite significant progress over the last years, there are still unsolved challenges that limit applicability of depth-based monocular full body motion capture. Algorithms are challenged by very noisy sensor data, (self) occlusions, or other ambiguities implied by the limited information that a depth sensor can extract of the scene. In this article, we give an overview on the state-of-the-art in full body human motion capture using depth cameras. Especially, we elaborate on the challenges current algorithms face and discuss possible solutions. Furthermore, we investigate how the integration of additional sensor modalities may help to resolve some of the ambiguities and improve tracking results.


Point Cloud Motion Capture Depth Image Iterative Close Point Inertial Sensor 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Moeslund, T., Hilton, A., Krüger, V.: A survey of advances in vision-based human motion capture and analysis. CVIU 104(2), 90–126 (2006)Google Scholar
  2. 2.
    Baak, A., Müller, M., Bharaj, G., Seidel, H.P., Theobalt, C.: A data-driven approach for real-time full body pose reconstruction from a depth camera. In: ICCV (2011)Google Scholar
  3. 3.
    Menache, A.: Understanding Motion Capture for Computer Animation and Video Games, 1st edn. Morgan Kaufmann Publishers Inc., San Francisco (1999)Google Scholar
  4. 4.
    Poppe, R.: A survey on vision-based human action recognition. Image and Vision Computing 28(6), 976–990 (2010)CrossRefGoogle Scholar
  5. 5.
    Bregler, C., Malik, J., Pullen, K.: Twist based acquisition and tracking of animal and human kinematics. IJCV 56(3), 179–194 (2004)CrossRefGoogle Scholar
  6. 6.
    Gall, J., Stoll, C., de Aguiar, E., Theobalt, C., Rosenhahn, B., Seidel, H.P.: Motion capture using joint skeleton tracking and surface estimation. In: CVPR, pp. 1746–1753 (2009)Google Scholar
  7. 7.
    Pons-Moll, G., Baak, A., Helten, T., Müller, M., Seidel, H.P., Rosenhahn, B.: Multisensor-fusion for 3d full-body human motion capture. In: CVPR, pp. 663–670 (2010)Google Scholar
  8. 8.
    Liu, Y., Stoll, C., Gall, J., Seidel, H.P., Theobalt, C.: Markerless motion capture of interacting characters using multi-view image segmentation. In: CVPR, pp. 1249–1256 (2011)Google Scholar
  9. 9.
    Stoll, C., Hasler, N., Gall, J., Seidel, H.P., Theobalt, C.: Fast articulated motion tracking using a sums of gaussians body model. In: ICCV, pp. 951–958 (2011)Google Scholar
  10. 10.
    Deutscher, J., Blake, A., Reid, I.: Articulated body motion capture by annealed particle filtering. In: CVPR, vol. 2, pp. 126–133 (2000)Google Scholar
  11. 11.
    Starck, J., Hilton, A.: Spherical matching for temporal correspondence of non-rigid surfaces. In: ICCV, pp. 1387–1394 (2005)Google Scholar
  12. 12.
    Starck, J., Hilton, A.: Correspondence labelling for wide-timeframe free-form surface matching. In: ICCV, pp. 1–8 (2007)Google Scholar
  13. 13.
    Starck, J., Hilton, A.: Surface capture for performance-based animation. IEEE Computer Graphics and Applications 27(3), 21–31 (2007)CrossRefGoogle Scholar
  14. 14.
    Matusik, W., Buehler, C., Raskar, R., Gortler, S., McMillan, L.: Image-based visual hulls. In: SIGGRAPH 2000, pp. 369–374 (2000)Google Scholar
  15. 15.
    de Aguiar, E., Stoll, C., Theobalt, C., Naveed, A., Seidel, H.P., Thrun, S.: Performance capture from sparse multi-view video. TOG 27, 1–10 (2008)CrossRefGoogle Scholar
  16. 16.
    Vlasic, D., Baran, I., Matusik, W., Popović, J.: Articulated mesh animation from multi-view silhouettes. TOG (2008)Google Scholar
  17. 17.
    Pekelny, Y., Gotsman, C.: Articulated object reconstruction and markerless motion capture from depth video. CGF 27(2), 399–408 (2008)Google Scholar
  18. 18.
    Knoop, S., Vacek, S., Dillmann, R.: Fusion of 2D and 3D sensor data for articulated body tracking. Robotics and Autonomous Systems 57(3), 321–329 (2009)CrossRefGoogle Scholar
  19. 19.
    Bleiweiss, A., Kutliroff, E., Eilat, G.: Markerless motion capture using a single depth sensor. In: SIGGRAPH ASIA Sketches (2009)Google Scholar
  20. 20.
    Friborg, R.M., Hauberg, S., Erleben, K.: GPU accelerated likelihoods for stereo-based articulated tracking. In: Kutulakos, K.N. (ed.) ECCV 2010 Workshops, Part II. LNCS, vol. 6554, pp. 359–371. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  21. 21.
    Demirdjian, D., Taycher, L., Shakhnarovich, G., Graumanand, K., Darrell, T.: Avoiding the streetlight effect: Tracking by exploring likelihood modes. In: ICCV, vol. 1, pp. 357–364 (2005)Google Scholar
  22. 22.
    Ganapathi, V., Plagemann, C., Koller, D., Thrun, S.: Real-time human pose tracking from range data. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part VI. LNCS, vol. 7577, pp. 738–751. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  23. 23.
    Ye, G., Liu, Y., Hasler, N., Ji, X., Dai, Q., Theobalt, C.: Performance capture of interacting characters with handheld kinects. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 828–841. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  24. 24.
    Plagemann, C., Ganapathi, V., Koller, D., Thrun, S.: Realtime identification and localization of body parts from depth images. In: ICRA, Anchorage, Alaska, USA (2010)Google Scholar
  25. 25.
    Zhu, Y., Dariush, B., Fujimura, K.: Kinematic self retargeting: A framework for human pose estimation. CVIU 114(12), 1362–1375 (2010), Special issue on Time-of-Flight Camera Based Computer VisionGoogle Scholar
  26. 26.
    Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from a single depth image. In: CVPR (2011)Google Scholar
  27. 27.
    Girshick, R., Shotton, J., Kohli, P., Criminisi, A., Fitzgibbon, A.: Efficient regression of general-activity human poses from depth images. In: ICCV, pp. 415–422 (2011)Google Scholar
  28. 28.
    Taylor, J., Shotton, J., Sharp, T., Fitzgibbon, A.W.: The Vitruvian manifold: Inferring dense correspondences for one-shot human pose estimation. In: CVPR (2012)Google Scholar
  29. 29.
    Salzmann, M., Urtasun, R.: Combining discriminative and generative methods for 3D deformable surface and articulated pose reconstruction. In: CVPR (2010)Google Scholar
  30. 30.
    Ganapathi, V., Plagemann, C., Thrun, S., Koller, D.: Real time motion capture using a single time-of-flight camera. In: CVPR (2010)Google Scholar
  31. 31.
    Ye, M., Wang, X., Yang, R., Ren, L., Pollefeys, M.: Accurate 3d pose estimation from a single depth image. In: ICCV, pp. 731–738 (2011)Google Scholar
  32. 32.
    Liao, M., Zhang, Q., Wang, H., Yang, R., Gong, M.: Modeling deformable objects from a single depth camera. In: ICCV, pp. 167–174 (2009)Google Scholar
  33. 33.
    Krüger, B., Tautges, J., Weber, A., Zinke, A.: Fast local and global similarity searches in large motion capture databases. In: Symposium on Computer Animation, pp. 1–10 (2010)Google Scholar
  34. 34.
    Wei, X., Zhang, P., Chai, J.: Accurate realtime full-body motion capture using a single depth camera. TOG 31(6), 188:1–188:12 (2012)Google Scholar
  35. 35.
    Weiss, A., Hirshberg, D., Black, M.: Home 3D body scans from noisy image and range data. In: ICCV (2011)Google Scholar
  36. 36.
    Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: Scape: Shape completion and animation of people. ACM TOG 24, 408–416 (2005)CrossRefGoogle Scholar
  37. 37.
    Hasler, N., Stoll, C., Sunkel, M., Rosenhahn, B., Seidel, H.P.: A statistical model of human pose and body shape. CGF 2(28) (March 2009)Google Scholar
  38. 38.
    Maimone, A., Fuchs, H.: Reducing interference between multiple structured light depth sensors using motion. In: 2012 IEEE Virtual Reality Short Papers and Posters (VRW), pp. 51–54 (2012)Google Scholar
  39. 39.
    Butler, A., Izadi, S., Hilliges, O., Molyneaux, D., Hodges, S., Kim, D.: Shake’n’sense: Reducing interference for overlapping structured light depth cameras. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2012, pp. 1933–1936 (2012)Google Scholar
  40. 40.
    Ziegler, J., Kretzschmar, H., Stachniss, C., Grisetti, G., Burgard, W.: Accurate human motion capture in large areas by combining IMU- and laser-based people tracking. In: IROS, pp. 86–91 (2011)Google Scholar
  41. 41.
    Chai, J., Hodgins, J.K.: Performance animation from low-dimensional control signals. TOG 24(3), 686–696 (2005)CrossRefGoogle Scholar
  42. 42.
    Slyper, R., Hodgins, J.K.: Action capture with accelerometers. In: Symposium on Computer Animation, pp. 193–199 (2008)Google Scholar
  43. 43.
    Tautges, J., Zinke, A., Krüger, B., Baumann, J., Weber, A., Helten, T., Müller, M., Seidel, H.P., Eberhardt, B.: Motion reconstruction using sparse accelerometer data. TOG 30(3), 18 (2011)CrossRefGoogle Scholar
  44. 44.
    Helten, T., Müller, M., Tautges, J., Weber, A., Seidel, H.-P.: Towards cross-modal comparison of human motion data. In: Mester, R., Felsberg, M. (eds.) DAGM 2011. LNCS, vol. 6835, pp. 61–70. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  45. 45.
    Brox, T., Rosenhahn, B., Gall, J., Cremers, D.: Combined region- and motion-based 3d tracking of rigid and articulated objects. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(3), 402–415 (2010)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Thomas Helten
    • 1
  • Andreas Baak
    • 1
  • Meinard Müller
    • 2
  • Christian Theobalt
    • 1
  1. 1.MPI InformatikSaarbrückenGermany
  2. 2.International Audio LaboratoriesErlangenGermany

Personalised recommendations