Marker-Less 3D Human Motion Capture with Monocular Image Sequence and Height-Maps

  • Yu Du
  • Yongkang Wong
  • Yonghao Liu
  • Feilin Han
  • Yilin Gui
  • Zhen Wang
  • Mohan Kankanhalli
  • Weidong GengEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9908)


The recovery of 3D human pose with monocular camera is an inherently ill-posed problem due to the large number of possible projections from the same 2D image to 3D space. Aimed at improving the accuracy of 3D motion reconstruction, we introduce the additional built-in knowledge, namely height-map, into the algorithmic scheme of reconstructing the 3D pose/motion under a single-view calibrated camera. Our novel proposed framework consists of two major contributions. Firstly, the RGB image and its calculated height-map are combined to detect the landmarks of 2D joints with a dual-stream deep convolution network. Secondly, we formulate a new objective function to estimate 3D motion from the detected 2D joints in the monocular image sequence, which reinforces the temporal coherence constraints on both the camera and 3D poses. Experiments with HumanEva, Human3.6M, and MCAD dataset validate that our method outperforms the state-of-the-art algorithms on both 2D joints localization and 3D motion recovery. Moreover, the evaluation results on HumanEva indicates that the performance of our proposed single-view approach is comparable to that of the multi-view deep learning counterpart.


Human pose estimation Height-map 



This work was supported by a grant from the National High Technology Research and Development Program of China (Program 863, 2013AA013705), and the National Natural Science Foundation of China (No. 61379067). This research was partly supported by the National Research Foundation, Prime Ministers Office, Singapore under its International Research Centre in Singapore Funding Initiative.

Supplementary material

419976_1_En_2_MOESM1_ESM.mp4 (3.2 mb)
Supplementary material 1 (mp4 3299 KB)


  1. 1.
    United Nations, Department of Economic, Social Affairs, Population Division: World population ageing 2013 (2013). ST/SEA/SER.A/348Google Scholar
  2. 2.
    Chen, X., Yuille, A.L.: Articulated pose estimation by a graphical model with image dependent pairwise relations. In: NIPS, pp. 1736–1744 (2014)Google Scholar
  3. 3.
    Wandt, B., Ackermann, H., Rosenhahn, B.: 3D human motion capture from monocular image sequences. In: CVPR Workshops, pp. 1–8 (2015)Google Scholar
  4. 4.
    Andriluka, M., Roth, S., Schiele, B.: Monocular 3D pose estimation and tracking by detection. In: CVPR, pp. 623–630 (2010)Google Scholar
  5. 5.
    Wang, C., Wang, Y., Lin, Z., Yuille, A.L., Gao, W.: Robust estimation of 3D human poses from a single image. In: CVPR, pp. 2369–2376 (2014)Google Scholar
  6. 6.
    Hofmann, M., Gavrila, D.M.: Multi-view 3D human pose estimation in complex environment. Int. J. Comput. Vis. 96(1), 103–124 (2012)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Hasler, N., Rosenhahn, B., Thormählen, T., Wand, M., Gall, J., Seidel, H.: Markerless motion capture with unsynchronized moving cameras. In: CVPR, pp. 224–231 (2009)Google Scholar
  8. 8.
    Elhayek, A., de Aguiar, E., Jain, A., Tompson, J., Pishchulin, L., Andriluka, M., Bregler, C., Schiele, B., Theobalt, C.: Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras. In: CVPR, pp. 3810–3818 (2015)Google Scholar
  9. 9.
    Ramakrishna, V., Kanade, T., Sheikh, Y.: Reconstructing 3D human pose from 2D image landmarks. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 573–586. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-33765-9_41 Google Scholar
  10. 10.
    Simo-Serra, E., Ramisa, A., Alenyà, G., Torras, C., Moreno-Noguer, F.: Single image 3D human pose estimation from noisy observations. In: CVPR, pp. 2673–2680 (2012)Google Scholar
  11. 11.
    Sridhar, S., Oulasvirta, A., Theobalt, C.: Interactive markerless articulated hand motion tracking using RGB and depth data. In: ICCV, pp. 2456–2463 (2013)Google Scholar
  12. 12.
    Gupta, S., Arbelaez, P., Girshick, R., Malik, J.: Aligning 3D models to RGB-D images of cluttered scenes. In: CVPR, pp. 4731–4740 (2015)Google Scholar
  13. 13.
    Fischler, M.A., Elschlager, R.A.: The representation and matching of pictorial structures. IEEE Trans. Comput. 22(1), 67–92 (1973)CrossRefGoogle Scholar
  14. 14.
    Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. Int. J. Comput. Vis. 61(1), 55–79 (2005)CrossRefGoogle Scholar
  15. 15.
    Andriluka, M., Roth, S., Schiele, B.: Pictorial structures revisited: people detection and articulated pose estimation. In: CVPR, pp. 1014–1021 (2009)Google Scholar
  16. 16.
    Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2878–2890 (2013)CrossRefGoogle Scholar
  17. 17.
    Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., Cook, M., Moore, R.: Real-time human pose recognition in parts from single depth images. Commun. ACM 56(1), 116–124 (2013)CrossRefGoogle Scholar
  18. 18.
    Zhang, D., Shah, M.: Human pose estimation in videos. In: ICCV, pp. 2012–2020 (2015)Google Scholar
  19. 19.
    Yasin, H., Iqbal, U., Krüger, B., Weber, A., Gall, J.: A dual-source approach for 3D pose estimation from a single image. In: CVPR, pp. 4948–4956 (2016)Google Scholar
  20. 20.
    Ionescu, C., Carreira, J., Sminchisescu, C.: Iterated second-order label sensitive pooling for 3D human pose estimation. In: CVPR, pp. 1661–1668 (2014)Google Scholar
  21. 21.
    Li, S., Chan, A.B.: 3D human pose estimation from monocular images with deep convolutional neural network. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9004, pp. 332–347. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-16808-1_23 Google Scholar
  22. 22.
    Simo-Serra, E., Quattoni, A., Torras, C., Moreno-Noguer, F.: A joint model for 2D and 3D pose estimation from a single image. In: CVPR, pp. 3634–3641 (2013)Google Scholar
  23. 23.
    Akhter, I., Black, M.J.: Pose-conditioned joint angle limits for 3D human pose reconstruction. In: CVPR, pp. 1446–1455 (2015)Google Scholar
  24. 24.
    Zhou, F., la Torre, F.D.: Spatio-temporal matching for human pose estimation in video. IEEE Trans. Pattern Anal. Mach. Intell. 38(8), 1492–1504 (2016)CrossRefGoogle Scholar
  25. 25.
    Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In: CVPR, pp. 1653–1660 (2014)Google Scholar
  26. 26.
    Tompson, J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: NIPS, pp. 1799–1807 (2014)Google Scholar
  27. 27.
    Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bregler, C.: Efficient object localization using convolutional networks. In: CVPR, pp. 648–656 (2015)Google Scholar
  28. 28.
    Li, S., Zhang, W., Chan, A.B.: Maximum-margin structured learning with deep networks for 3D human pose estimation. In: ICCV, pp. 2848–2856 (2015)Google Scholar
  29. 29.
    Tekin, B., Rozantsev, A., Lepetit, V., Fua, P.: Direct prediction of 3D body poses from motion compensated sequences. In: CVPR, pp. 991–1000 (2016)Google Scholar
  30. 30.
    Kostrikov, I.: Depth sweep regression forests for estimating 3D human pose from images. In: BMVC, pp. 1–13 (2014)Google Scholar
  31. 31.
    Hong, C., Yu, J., Wan, J., Tao, D., Wang, M.: Multimodal deep autoencoder for human pose recovery. IEEE Trans. Image Process. 24(12), 5659–5670 (2015)MathSciNetCrossRefGoogle Scholar
  32. 32.
    Zhou, X., Zhu, M., Leonardos, S., Derpanis, K., Daniilidis, K.: Sparseness meets deepness: 3D human pose estimation from monocular video. In: CVPR, pp. 4966–4975 (2016)Google Scholar
  33. 33.
    Park, S.-W., Kim, T.-E., Choi, J.-S.: Robust estimation of heights of moving people using a single camera. In: Kim, K.J., Ahn, S.J. (eds.) Proceedings of the International Conference on IT Convergence and Security 2011. LNEE, vol. 120, pp. 389–405. Springer, Heidelberg (2012). doi: 10.1007/978-94-007-2911-7_36 CrossRefGoogle Scholar
  34. 34.
    Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: BMVC, pp. 1–11 (2010)Google Scholar
  35. 35.
    Benbakreti, S., Benyettou, M.: Gait recognition based on leg motion and contour of silhouette. In: ICITeS, pp. 1–5 (2012)Google Scholar
  36. 36.
    Srivastava, N., Salakhutdinov, R.R.: Multimodal learning with deep boltzmann machines. In: NIPS, pp. 2222–2230 (2012)Google Scholar
  37. 37.
    Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Simultaneous detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 297–312. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10584-0_20 Google Scholar
  38. 38.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)Google Scholar
  39. 39.
    Eitel, A., Springenberg, J.T., Spinello, L., Riedmiller, M.A., Burgard, W.: Multimodal deep learning for robust RGB-D object recognition. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, September 2015, pp. 681–687 (2015)Google Scholar
  40. 40.
    Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10584-0_23 Google Scholar
  41. 41.
    Li, W., Wong, Y., Liu, A.A., Li, Y., Su, Y.T., Kankanhalli, M.: Multi-camera action dataset (MCAD): a dataset for studying non-overlapped cross-camera action recognition. CoRR abs/1607.06408 (2016)Google Scholar
  42. 42.
    Moré, J.J.: The levenberg-marquardt algorithm: implementation and theory. In: Watson, G.A. (ed.) Numerical Analysis, pp. 105–116. Springer, Heidelberg (1978)CrossRefGoogle Scholar
  43. 43.
    Lepetit, V., Moreno-Noguer, F., Fua, P.: EPnP: an accurate \( O(n)\) solution to the PnP problem. Int. J. Comput. Vis. 81(2), 155–166 (2009)CrossRefGoogle Scholar
  44. 44.
    Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2014)CrossRefGoogle Scholar
  45. 45.
    Sigal, L., Balan, A., Black, M.: HumanEva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vis. 87(1–2), 4–27 (2010)CrossRefGoogle Scholar
  46. 46.
    Fan, X., Zheng, K., Lin, Y., Wang, S.: Combining local appearance and holistic view: dual-source deep neural networks for human pose estimation. In: CVPR, pp. 1347–1355 (2015)Google Scholar
  47. 47.
    Carnegie Mellon University Motion Capture Database,

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Yu Du
    • 1
  • Yongkang Wong
    • 2
  • Yonghao Liu
    • 1
  • Feilin Han
    • 1
  • Yilin Gui
    • 1
  • Zhen Wang
    • 1
  • Mohan Kankanhalli
    • 2
    • 3
  • Weidong Geng
    • 1
    Email author
  1. 1.College of Computer ScienceZhejiang UniversityHangzhouChina
  2. 2.Interactive and Digital Media InstituteNational University of SingaporeSingaporeSingapore
  3. 3.School of ComputingNational University of SingaporeSingaporeSingapore

Personalised recommendations