International Journal of Computer Vision

, Volume 126, Issue 9, pp 993–1008 | Cite as

Image-Based Synthesis for Deep 3D Human Pose Estimation

  • Grégory Rogez
  • Cordelia Schmid


This paper addresses the problem of 3D human pose estimation in the wild. A significant challenge is the lack of training data, i.e., 2D images of humans annotated with 3D poses. Such data is necessary to train state-of-the-art CNN architectures. Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations. We introduce an image-based synthesis engine that artificially augments a dataset of real images with 2D human pose annotations using 3D motion capture data. Given a candidate 3D pose, our algorithm selects for each joint an image whose 2D pose locally matches the projected 3D pose. The selected images are then combined to generate a new synthetic image by stitching local image patches in a kinematically constrained manner. The resulting images are used to train an end-to-end CNN for full-body 3D pose estimation. We cluster the training data into a large number of pose classes and tackle pose estimation as a K-way classification problem. Such an approach is viable only with large training sets such as ours. Our method outperforms most of the published works in terms of 3D pose estimation in controlled environments (Human3.6M) and shows promising results for real-world images (LSP). This demonstrates that CNNs trained on artificial images generalize well to real images. Compared to data generated from more classical rendering engines, our synthetic images do not require any domain adaptation or fine-tuning stage.


Human 3D pose estimation Data augmentation CNN Data synthesis 



This work was supported by the European Commission under FP7 Marie Curie IOF Grant (PIOF-GA-2012-328288) and partially supported by the ERC advanced Grant ALLEGRO and an Amazon Academic Research Award (AARA). We acknowledge the support of NVIDIA with the donation of the GPUs used for this research. We thank Dr. Philippe Weinzaepfel for his help. We also thank the anonymous reviewers for their comments and suggestions that helped improve the paper.


  1. Agarwal, A., & Triggs, B. (2006). Recovering 3D human pose from monocular images. PAMI, 28(1), 44–58.CrossRefGoogle Scholar
  2. Akhter, I., & Black, M. (2015). Pose-conditioned joint angle limits for 3D human pose reconstruction. In CVPR Google Scholar
  3. Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2D human pose estimation: New benchmark and state-of- the-art analysis. In CVPR Google Scholar
  4. Bissacco, A., Yang, M.-H., & Soatto, S. (2006). Detecting humans via their pose. In NIPS Google Scholar
  5. Bo, L., & Sminchisescu, C. (2010). Twin Gaussian processes for structured prediction. IJCV, 87(1–2), 28–52.CrossRefGoogle Scholar
  6. Bourdev, L., & Malik, J. (2009). Poselets: Body part detectors trained using 3D human pose annotations. In ICCV Google Scholar
  7. Chen, C.-H. & Ramanan, D. (2017). 3D human pose estimation = 2D pose estimation + matching. In CVPR Google Scholar
  8. Chen, W., Wang, H., Li, Y., Su, H., Wang, Z., Tu, C., Lischinski, D., Cohen-Or, D., & Chen, B. (2016). Synthesizing training images for boosting human 3D pose estimation. In 3DV Google Scholar
  9. Chen, X., & Yuille, A.L. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS Google Scholar
  10. de Souza, C. R., Gaidon, A., Cabon, Y., & Lopez, A.M. (2017). Procedural generation of videos to train deep action recognition networks. In CVPR Google Scholar
  11. Dosovitskiy, A., Fischer, P., Ilg, E., Häusser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., & Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. In ICCV Google Scholar
  12. Du, Y., Wong, Y., Liu, Y., Han, F., Gui, Y., Wang, Z., Kankanhalli, M., & Geng, W. (2016). Marker-less 3D human motion capture with monocular image sequence and height-maps. In ECCV Google Scholar
  13. Elhayek, A., Aguiar, E., Jain, A., Tompson, J., Pishchulin, L., Andriluka, M., Bregler, C., Schiele, B., & Theobalt, C. (2015). Efficient convnet-based marker-less motion capture in general scenes with a low number of cameras. In CVPR Google Scholar
  14. Enzweiler, M., & Gavrila, D.M. (2008). A mixed generative-discriminative framework for pedestrian classification. In CVPR Google Scholar
  15. Fan, X., Zheng, K., Zhou, Y., & Wang, S. (2014). Pose locality constrained representation for 3D human pose reconstruction. In ECCV Google Scholar
  16. Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., & Bengio, Y. (2014). Generative adversarial nets. In NIPS Google Scholar
  17. Hattori, H., Boddeti, V.N., Kitani, K.M., & Kanade, T. (2015). Learning scene-specific pedestrian detectors without real data. In CVPR Google Scholar
  18. Hornung, A., Dekkers, E., & Kobbelt, L. (2007). Character animation from 2D pictures and 3D motion data. ACM Transactons On Graphics, 26(1), 1.CrossRefGoogle Scholar
  19. Huang, S., & Ramanan, D. (2017). Expecting the unexpected: Training detectors for unusual pedestrians with adversarial imposters. In CVPR. Google Scholar
  20. Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human(3).6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. PAMI, 36(7), 1325–1339.CrossRefGoogle Scholar
  21. Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016). Reading text in the wild with convolutional neural networks. IJCV, 116(1), 1–20.MathSciNetCrossRefGoogle Scholar
  22. Jaderberg, M., Simonyan, K., Zisserman, A., & Kavukcuoglu, K. (2015). Spatial transformer networks. In NIPS Google Scholar
  23. Johnson, S., & Everingham, M. (2010). Clustered pose and nonlinear appearance models for human pose estimation. In BMVC Google Scholar
  24. Johnson, S., & Everingham, M. (2011). Learning effective human pose estimation from inaccurate annotation. In CVPR Google Scholar
  25. Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara, S., & Sheikh, Y. (2015). Panoptic studio: A massively multiview system for social motion capture. In ICCV Google Scholar
  26. Kostrikov, I., & Gall, J. (2014). Depth sweep regression forests for estimating 3D human pose from images. In BMVC Google Scholar
  27. Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS Google Scholar
  28. Li, S., Zhang, W., & Chan, A.B. (2015). Maximum-margin structured learning with deep networks for 3D human pose estimation. In ICCV Google Scholar
  29. Li, S., Zhang, W., & Chan, A.B. (2016). Maximum-margin structured learning with deep networks for 3D human pose estimation. In IJCV Google Scholar
  30. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., & Black, M. J. (2015). SMPL: A skinned multi-person linear model. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia), 34(6), 248:1–248:16.Google Scholar
  31. Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., & Theobalt, C. (2017). Monocular 3D human pose estimation in the wild using improved CNN supervision. In 3D Vision (3DV) Google Scholar
  32. Moreno-Noguer, F. (2017). 3D human pose estimation from a single image via distance matrix regression. In CVPR Google Scholar
  33. Mori, G., & Malik, J. (2006). Recovering 3D human body configurations using shape contexts. PAMI, 28(7), 1052–1062.CrossRefGoogle Scholar
  34. Okada, R., & Soatto, S. (2008). Relevant feature selection for human pose estimation and localization in cluttered images. In ECCV Google Scholar
  35. Park, D., & Ramanan, D. (2015). Articulated pose estimation with tiny synthetic videos. In CVPR ChaLearn Looking at People Workshop Google Scholar
  36. Pavlakos, G., Zhou, X., Derpanis, K.G., & Daniilidis, K. (2017). Coarse-to-fine volumetric prediction for single-image 3D human pose. In CVPR Google Scholar
  37. Peng, X., Sun, B., Ali, K., & Saenko, K. (2015). Learning deep object detectors from 3D models. In ICCV Google Scholar
  38. Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., & Schiele, B. (2016). DeepCut: Joint subset partition and labeling for multi person pose estimation. CVPR Google Scholar
  39. Pishchulin, L., Jain, A., Andriluka, M., T. Thormählen, & Schiele, B. (2012). Articulated people detection and pose estimation: Reshaping the future. In CVPR Google Scholar
  40. Ramakrishna, V., Kanade, T., & Sheikh, Y. (2012). Reconstructing 3D human pose from 2D image landmarks. In ECCV Google Scholar
  41. Rogez, G., Rihan, J., Orrite, C., & Torr, P. (2012). Fast human pose detection using randomized hierarchical cascades of rejectors. IJCV, 99(1), 25–52.MathSciNetCrossRefGoogle Scholar
  42. Rogez, G., & Schmid, C. (2016). MoCap-guided data augmentation for 3D pose estimation in the wild. In NIPS Google Scholar
  43. Rogez, G., Supancic, J., & Ramanan, D. (2015). First-person pose recognition using egocentric workspaces. In CVPR Google Scholar
  44. Rogez, G., Weinzaepfel, P., & Schmid, C. (2017). LCR-Net: Localization-Classification-Regression for human pose. In CVPR Google Scholar
  45. Romero, J., Kjellstrom, H., & Kragic, D. (2010). Hands in action: Real-time 3D reconstruction of hands in interaction with objects. In ICRA Google Scholar
  46. Sanzari, M., Ntouskos, V., & Pirri, F. (2016). Bayesian image based 3D pose estimation. In ECCV Google Scholar
  47. Shakhnarovich, G., Viola, P.A., & Darrell, T. (2003). Fast pose estimation with parameter-sensitive hashing. In ICCV Google Scholar
  48. Shotton, J., Fitzgibbon, A.W., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from single depth images. In CVPR Google Scholar
  49. Sigal, L., Balan, A. O., & Black, M. J. (2010). Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. IJCV, 87(1–2), 4–27.CrossRefGoogle Scholar
  50. Sigal, L., & Black, M.J. (2006). Predicting 3D people from 2D pictures. In AMDO Google Scholar
  51. Simo-Serra, E., Quattoni, A., Torras, C., & Moreno-Noguer, F. (2013). A joint model for 2D and 3D pose estimation from a single image. In CVPR Google Scholar
  52. Simo-Serra, E., Ramisa, A., G. Alenyà, Torras, C., & Moreno-Noguer, F. (2012). Single image 3D human pose estimation from noisy observations. In CVPR Google Scholar
  53. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556Google Scholar
  54. Su, H., Ruizhongtai, C., Qi, Y.Li, & Guibas, L.J. (2015). Render for CNN: Viewpoint estimation in images using CNNs trained with rendered 3D model views. In ICCV Google Scholar
  55. Taylor, J. C. (2000). Reconstruction of articulated objects from point correspondences in a single uncalibrated image. In CVPR Google Scholar
  56. Tekin, B., Katircioglu, I., Salzmann, M., Lepetit, V., & Fua, P. (2016). Structured prediction of 3D human pose with deep neural networks. In BMVC Google Scholar
  57. Tekin, B., Rozantsev, A., Lepetit, V., & Fua, P. (2016). Direct prediction of 3D body poses from motion compensated sequences. In CVPR Google Scholar
  58. Tome, D., Russell, C., & Agapito, L. (2017). Lifting from the deep: Convolutional 3D pose estimation from a single image. In CVPR Google Scholar
  59. Tompson, J.J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS Google Scholar
  60. Toshev, A., & Szegedy C. (2014) DeepPose: Human pose estimation via deep neural networks. In CVPR Google Scholar
  61. Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M.J., Laptev, I., & Schmid, C. (2017). Learning from synthetic humans. In CVPR Google Scholar
  62. Wang, C., Wang, Y., Lin, Z., Yuille, A. L., & Gao, W. (2014). Robust estimation of 3D human poses from a single image. In CVPR Google Scholar
  63. Wei, S.-E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016) Convolutional pose machines. In CVPR Google Scholar
  64. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., & Xiao, J (2015) 3D shapenets: A deep representation for volumetric shapes. In CVPR Google Scholar
  65. Xu, J., Ramos, S., Vázquez, D., & López, A. M. (2014). Domain adaptation of deformable part-based models. PAMI, 36(12), 2367–2380.CrossRefGoogle Scholar
  66. Yang, W., Ouyang, W., Li, H., & Wang, X. (2016) End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In CVPR Google Scholar
  67. Yasin, H., Iqbal, U., Krüger, B., Weber, A., & Gall, J. (2016) A dual-source approach for 3D pose estimation from a single image. In CVPR Google Scholar
  68. Zhou, F., & De la Torre, F (2014) Spatio-temporal matching for human detection in video. In ECCV Google Scholar
  69. Zhou, X., Huang, Q., Sun, X., Xue, X., & Wei, Y. (2017) Towards 3D human pose estimation in the wild: A weakly-supervised approach. In ICCV Google Scholar
  70. Zhou, X., Sun, X., Zhang, W., Liang, S., & Wei, Y (2016) Deep kinematic pose regression. In ECCV Workshop on Geometry Meets Deep Learning Google Scholar
  71. Zhou, X., Zhu, M., Leonardos, S., Derpanis, K., & Daniilidis, K. (2016) Sparseness meets deepness: 3D human pose estimation from monocular video. In CVPR Google Scholar
  72. Zuffi, S., & Black, M.J. (2015) The stitched puppet: A graphical model of 3D human shape and pose. In CVPR Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Univ. Grenoble AlpesInria, CNRS, Grenoble INP(Institute of Engineering Univ., Grenoble Alpes), LJKGrenobleFrance

Personalised recommendations