Learning 3D Human Pose from Structure and Motion

  • Rishabh DabralEmail author
  • Anurag Mundhada
  • Uday Kusupati
  • Safeer Afaque
  • Abhishek Sharma
  • Arjun Jain
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11213)


3D human pose estimation from a single image is a challenging problem, especially for in-the-wild settings due to the lack of 3D annotated data. We propose two anatomically inspired loss functions and use them with a weakly-supervised learning framework to jointly learn from large-scale in-the-wild 2D and indoor/synthetic 3D data. We also present a simple temporal network that exploits temporal and structural cues present in predicted pose sequences to temporally harmonize the pose estimations. We carefully analyze the proposed contributions through loss surface visualizations and sensitivity analysis to facilitate deeper understanding of their working mechanism. Jointly, the two networks capture the anatomical constraints in static and kinetic states of the human body. Our complete pipeline improves the state-of-the-art by 11.8% and 12% on Human3.6M and MPI-INF-3DHP, respectively, and runs at 30 FPS on a commodity graphics card.



This work is supported by Mercedes-Benz Research & Development India (RD/0117-MBRDI00-001).

Supplementary material

474192_1_En_41_MOESM1_ESM.pdf (470 kb)
Supplementary material 1 (pdf 469 KB)


  1. 1.
    Akhter, I., Black, M.J.: Pose-conditioned joint angle limits for 3D human pose reconstruction. In: CVPR (2015)Google Scholar
  2. 2.
    Alldieck, T., Kassubeck, M., Wandt, B., Rosenhahn, B., Magnor, M.: Optical flow-based 3D human motion estimation from monocular video. In: Roth, V., Vetter, T. (eds.) GCPR 2017. LNCS, vol. 10496, pp. 347–360. Springer, Cham (2017). Scholar
  3. 3.
    Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: New benchmark and state of the art analysis. In: CVPR (2014)Google Scholar
  4. 4.
    Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016). Scholar
  5. 5.
    Casiez, G., Roussel, N., Vogel. D.: 1 filter: a simple speed-based low-pass filter for noisy input in interactive systems. In: SIGCHI (2012)Google Scholar
  6. 6.
    Sminchisescu, C., Ionescu, C., Li, F.: Latent structured models for human pose estimation. In: ICCV (2011)Google Scholar
  7. 7.
    Chen, C.-H., Ramanan, D.: 3D human pose estimation \(=\) 2D pose estimation + matching. In: CVPR (2017)Google Scholar
  8. 8.
    Chen, J., Nie, S., Ji, Q.: Data-free prior model for upper body pose estimation and tracking. IEEE Trans. Image Process. 22, 4627–4639 (2013)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Chen, W., et al.: Synthesizing training images for boosting human 3D pose estimation. In: 3DV (2016)Google Scholar
  10. 10.
    Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)Google Scholar
  11. 11.
    Coskun, H., Achilles, F., DiPietro, R., Navab, N., Tombari, F.: Long short-term memory Kalman filters: recurrent neural estimators for pose regularization. In: ICCV (2017)Google Scholar
  12. 12.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  13. 13.
    Herda, L., Urtasun, R., Fua, P.: Hierarchical implicit surface joint limits for human body tracking. Comput. Vis. Image Underst. 99, 189–209 (2005)CrossRefGoogle Scholar
  14. 14.
    Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE TPAMI 36, 1325–1339 (2014)CrossRefGoogle Scholar
  15. 15.
    Jahangiri, E., Yuille, A.L.: Generating multiple diverse hypotheses for human 3D pose consistent with 2D joint detections. In: ICCV (2017)Google Scholar
  16. 16.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  17. 17.
    Li, S., Chan, A.B.: 3D human pose estimation from monocular images with deep convolutional neural network. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9004, pp. 332–347. Springer, Cham (2015). Scholar
  18. 18.
    Li, S., Zhang, W., Chan, A.B.: Maximum-margin structured learning with deep networks for 3D human pose estimation. In: ICCV (2015)Google Scholar
  19. 19.
    Lin, M., Lin, L., Liang, X., Wang, K., Cheng, H.: Recurrent 3D pose sequence machines. In: CVPR (2017)Google Scholar
  20. 20.
    Lin, T., et al.: Microsoft COCO: common objects in context. arXiv preprint arXiv:1405.0312 (2014)
  21. 21.
    Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34, 248 (2015)CrossRefGoogle Scholar
  22. 22.
    Mehta, D., et al.: Monocular 3D human pose estimation in the wild using improved CNN supervision. In: 3DV (2017)Google Scholar
  23. 23.
    Mehta, D.: VNect: real-time 3D human pose estimation with a single RGB camera. ACM ToG 36, 44 (2017)CrossRefGoogle Scholar
  24. 24.
    Moreno-Noguer, F.: 3D human pose estimation from a single image via distance matrix regression. In: CVPR (2017)Google Scholar
  25. 25.
    Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). Scholar
  26. 26.
    Park, M.J., Choi, M.G., Shinagawa, Y., Shin, S.Y.: Video-guided motion synthesis using example motions. ACM ToG 25, 1327–1359 (2006)CrossRefGoogle Scholar
  27. 27.
    Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3D human pose. In: CVPR (2017)Google Scholar
  28. 28.
    Ramakrishna, V., Kanade, T., Sheikh, Y.: Reconstructing 3D human pose from 2D image landmarks. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 573–586. Springer, Heidelberg (2012). Scholar
  29. 29.
    Rogez, G., Schmid, C.: MoCap-guided data augmentation for 3D pose estimation in the wild. In: NIPS (2016)Google Scholar
  30. 30.
    Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-Net: localization-classification-regression for human pose. In: CVPR (2017)Google Scholar
  31. 31.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. ArXiv e-prints (2014)Google Scholar
  32. 32.
    Sarafianos, N., Boteanu, B., Ionescu, B., Kakadiaris, I.A.: 3D human pose estimation: a review of the literature and analysis of covariates. Comput. Vis. Image Underst. 152, 1–20 (2016)CrossRefGoogle Scholar
  33. 33.
    Sminchisescu, C., Triggs, B.: Estimating articulated human motion with covariance scaled sampling. Int. J. Robot. Res. 22, 371–391 (2003)CrossRefGoogle Scholar
  34. 34.
    Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional human pose regression. In: ICCV (2017)Google Scholar
  35. 35.
    Tome, D., Russell, C., Agapito, L.: Lifting from the deep: convolutional 3D pose estimation from a single image. In: CVPR (2017)Google Scholar
  36. 36.
    Urtasun, R., Fleet, D.J., Fua, P.: Temporal motion models for monocular and multiview 3D human body tracking. Comput. Vis. Image Underst. 104, 157–177 (2006)CrossRefGoogle Scholar
  37. 37.
    Varol, G., et al.: Learning from synthetic humans. In: CVPR (2017)Google Scholar
  38. 38.
    Wei, X., Chai, J.: VideoMocap: modeling physically realistic human motion from monocular video sequences. ACM ToG 29, 42 (2010)Google Scholar
  39. 39.
    Nie, B.X., Wei, P., Zhu, S.-C.: Monocular 3D human pose estimation by predicting depth on joints. In: ICCV, October 2017Google Scholar
  40. 40.
    Yasin, H., Iqbal, U., Kruger, B., Weber, A., Gall, J.: A dual-source approach for 3D pose estimation from a single image. In: CVPR (2016)Google Scholar
  41. 41.
    Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: Towards 3D human pose estimation in the wild: a weakly-supervised approach. In: ICCV (2017)Google Scholar
  42. 42.
    Zhou, X., Sun, X., Zhang, W., Liang, S., Wei, Y.: Deep kinematic pose regression. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 186–201. Springer, Cham (2016). Scholar
  43. 43.
    Zhou, X., Zhu, M., Derpanis, K., Daniilidis, K.: Sparseness meets deepness: 3D human pose estimation from monocular video. In: CVPR (2016)Google Scholar
  44. 44.
    Zhou, Z.-H.: A brief introduction to weakly supervised learning. Natl. Sci. Rev. 5, 44–53 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Rishabh Dabral
    • 1
    Email author
  • Anurag Mundhada
    • 1
  • Uday Kusupati
    • 1
  • Safeer Afaque
    • 1
  • Abhishek Sharma
    • 2
  • Arjun Jain
    • 1
  1. 1.Indian Institute of Technology BombayMumbaiIndia
  2. 2.Gobasco AI LabsLucknowIndia

Personalised recommendations