Image-Based Synthesis for Deep 3D Human Pose Estimation

Rogez, Grégory; Schmid, Cordelia

doi:10.1007/s11263-018-1071-9

Image-Based Synthesis for Deep 3D Human Pose Estimation

Published: 19 March 2018

Volume 126, pages 993–1008, (2018)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

1688 Accesses
22 Citations
Explore all metrics

Abstract

This paper addresses the problem of 3D human pose estimation in the wild. A significant challenge is the lack of training data, i.e., 2D images of humans annotated with 3D poses. Such data is necessary to train state-of-the-art CNN architectures. Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations. We introduce an image-based synthesis engine that artificially augments a dataset of real images with 2D human pose annotations using 3D motion capture data. Given a candidate 3D pose, our algorithm selects for each joint an image whose 2D pose locally matches the projected 3D pose. The selected images are then combined to generate a new synthetic image by stitching local image patches in a kinematically constrained manner. The resulting images are used to train an end-to-end CNN for full-body 3D pose estimation. We cluster the training data into a large number of pose classes and tackle pose estimation as a K-way classification problem. Such an approach is viable only with large training sets such as ours. Our method outperforms most of the published works in terms of 3D pose estimation in controlled environments (Human3.6M) and shows promising results for real-world images (LSP). This demonstrates that CNNs trained on artificial images generalize well to real images. Compared to data generated from more classical rendering engines, our synthetic images do not require any domain adaptation or fine-tuning stage.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised Human Pose Estimation on Depth Images

VirtualPose: Learning Generalizable 3D Human Pose Models from Virtual Data

A Tool for Building Multi-purpose and Multi-pose Synthetic Data Sets

Notes

http://mocap.cs.cmu.edu.
ground truth classes being obtained by assigning the ground truth 2D pose to the closest cluster.

References

Agarwal, A., & Triggs, B. (2006). Recovering 3D human pose from monocular images. PAMI, 28(1), 44–58.
Article Google Scholar
Akhter, I., & Black, M. (2015). Pose-conditioned joint angle limits for 3D human pose reconstruction. In CVPR
Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2D human pose estimation: New benchmark and state-of- the-art analysis. In CVPR
Bissacco, A., Yang, M.-H., & Soatto, S. (2006). Detecting humans via their pose. In NIPS
Bo, L., & Sminchisescu, C. (2010). Twin Gaussian processes for structured prediction. IJCV, 87(1–2), 28–52.
Article Google Scholar
Bourdev, L., & Malik, J. (2009). Poselets: Body part detectors trained using 3D human pose annotations. In ICCV
Chen, C.-H. & Ramanan, D. (2017). 3D human pose estimation = 2D pose estimation + matching. In CVPR
Chen, W., Wang, H., Li, Y., Su, H., Wang, Z., Tu, C., Lischinski, D., Cohen-Or, D., & Chen, B. (2016). Synthesizing training images for boosting human 3D pose estimation. In 3DV
Chen, X., & Yuille, A.L. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS
de Souza, C. R., Gaidon, A., Cabon, Y., & Lopez, A.M. (2017). Procedural generation of videos to train deep action recognition networks. In CVPR
Dosovitskiy, A., Fischer, P., Ilg, E., Häusser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., & Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. In ICCV
Du, Y., Wong, Y., Liu, Y., Han, F., Gui, Y., Wang, Z., Kankanhalli, M., & Geng, W. (2016). Marker-less 3D human motion capture with monocular image sequence and height-maps. In ECCV
Elhayek, A., Aguiar, E., Jain, A., Tompson, J., Pishchulin, L., Andriluka, M., Bregler, C., Schiele, B., & Theobalt, C. (2015). Efficient convnet-based marker-less motion capture in general scenes with a low number of cameras. In CVPR
Enzweiler, M., & Gavrila, D.M. (2008). A mixed generative-discriminative framework for pedestrian classification. In CVPR
Fan, X., Zheng, K., Zhou, Y., & Wang, S. (2014). Pose locality constrained representation for 3D human pose reconstruction. In ECCV
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., & Bengio, Y. (2014). Generative adversarial nets. In NIPS
Hattori, H., Boddeti, V.N., Kitani, K.M., & Kanade, T. (2015). Learning scene-specific pedestrian detectors without real data. In CVPR
Hornung, A., Dekkers, E., & Kobbelt, L. (2007). Character animation from 2D pictures and 3D motion data. ACM Transactons On Graphics, 26(1), 1.
Article Google Scholar
Huang, S., & Ramanan, D. (2017). Expecting the unexpected: Training detectors for unusual pedestrians with adversarial imposters. In CVPR.
Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human(3).6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. PAMI, 36(7), 1325–1339.
Article Google Scholar
Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016). Reading text in the wild with convolutional neural networks. IJCV, 116(1), 1–20.
Article MathSciNet Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A., & Kavukcuoglu, K. (2015). Spatial transformer networks. In NIPS
Johnson, S., & Everingham, M. (2010). Clustered pose and nonlinear appearance models for human pose estimation. In BMVC
Johnson, S., & Everingham, M. (2011). Learning effective human pose estimation from inaccurate annotation. In CVPR
Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara, S., & Sheikh, Y. (2015). Panoptic studio: A massively multiview system for social motion capture. In ICCV
Kostrikov, I., & Gall, J. (2014). Depth sweep regression forests for estimating 3D human pose from images. In BMVC
Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS
Li, S., Zhang, W., & Chan, A.B. (2015). Maximum-margin structured learning with deep networks for 3D human pose estimation. In ICCV
Li, S., Zhang, W., & Chan, A.B. (2016). Maximum-margin structured learning with deep networks for 3D human pose estimation. In IJCV
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., & Black, M. J. (2015). SMPL: A skinned multi-person linear model. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia), 34(6), 248:1–248:16.
Google Scholar
Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., & Theobalt, C. (2017). Monocular 3D human pose estimation in the wild using improved CNN supervision. In 3D Vision (3DV)
Moreno-Noguer, F. (2017). 3D human pose estimation from a single image via distance matrix regression. In CVPR
Mori, G., & Malik, J. (2006). Recovering 3D human body configurations using shape contexts. PAMI, 28(7), 1052–1062.
Article Google Scholar
Okada, R., & Soatto, S. (2008). Relevant feature selection for human pose estimation and localization in cluttered images. In ECCV
Park, D., & Ramanan, D. (2015). Articulated pose estimation with tiny synthetic videos. In CVPR ChaLearn Looking at People Workshop
Pavlakos, G., Zhou, X., Derpanis, K.G., & Daniilidis, K. (2017). Coarse-to-fine volumetric prediction for single-image 3D human pose. In CVPR
Peng, X., Sun, B., Ali, K., & Saenko, K. (2015). Learning deep object detectors from 3D models. In ICCV
Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., & Schiele, B. (2016). DeepCut: Joint subset partition and labeling for multi person pose estimation. CVPR
Pishchulin, L., Jain, A., Andriluka, M., T. Thormählen, & Schiele, B. (2012). Articulated people detection and pose estimation: Reshaping the future. In CVPR
Ramakrishna, V., Kanade, T., & Sheikh, Y. (2012). Reconstructing 3D human pose from 2D image landmarks. In ECCV
Rogez, G., Rihan, J., Orrite, C., & Torr, P. (2012). Fast human pose detection using randomized hierarchical cascades of rejectors. IJCV, 99(1), 25–52.
Article MathSciNet Google Scholar
Rogez, G., & Schmid, C. (2016). MoCap-guided data augmentation for 3D pose estimation in the wild. In NIPS
Rogez, G., Supancic, J., & Ramanan, D. (2015). First-person pose recognition using egocentric workspaces. In CVPR
Rogez, G., Weinzaepfel, P., & Schmid, C. (2017). LCR-Net: Localization-Classification-Regression for human pose. In CVPR
Romero, J., Kjellstrom, H., & Kragic, D. (2010). Hands in action: Real-time 3D reconstruction of hands in interaction with objects. In ICRA
Sanzari, M., Ntouskos, V., & Pirri, F. (2016). Bayesian image based 3D pose estimation. In ECCV
Shakhnarovich, G., Viola, P.A., & Darrell, T. (2003). Fast pose estimation with parameter-sensitive hashing. In ICCV
Shotton, J., Fitzgibbon, A.W., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from single depth images. In CVPR
Sigal, L., Balan, A. O., & Black, M. J. (2010). Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. IJCV, 87(1–2), 4–27.
Article Google Scholar
Sigal, L., & Black, M.J. (2006). Predicting 3D people from 2D pictures. In AMDO
Simo-Serra, E., Quattoni, A., Torras, C., & Moreno-Noguer, F. (2013). A joint model for 2D and 3D pose estimation from a single image. In CVPR
Simo-Serra, E., Ramisa, A., G. Alenyà, Torras, C., & Moreno-Noguer, F. (2012). Single image 3D human pose estimation from noisy observations. In CVPR
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556
Su, H., Ruizhongtai, C., Qi, Y.Li, & Guibas, L.J. (2015). Render for CNN: Viewpoint estimation in images using CNNs trained with rendered 3D model views. In ICCV
Taylor, J. C. (2000). Reconstruction of articulated objects from point correspondences in a single uncalibrated image. In CVPR
Tekin, B., Katircioglu, I., Salzmann, M., Lepetit, V., & Fua, P. (2016). Structured prediction of 3D human pose with deep neural networks. In BMVC
Tekin, B., Rozantsev, A., Lepetit, V., & Fua, P. (2016). Direct prediction of 3D body poses from motion compensated sequences. In CVPR
Tome, D., Russell, C., & Agapito, L. (2017). Lifting from the deep: Convolutional 3D pose estimation from a single image. In CVPR
Tompson, J.J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS
Toshev, A., & Szegedy C. (2014) DeepPose: Human pose estimation via deep neural networks. In CVPR
Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M.J., Laptev, I., & Schmid, C. (2017). Learning from synthetic humans. In CVPR
Wang, C., Wang, Y., Lin, Z., Yuille, A. L., & Gao, W. (2014). Robust estimation of 3D human poses from a single image. In CVPR
Wei, S.-E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016) Convolutional pose machines. In CVPR
Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., & Xiao, J (2015) 3D shapenets: A deep representation for volumetric shapes. In CVPR
Xu, J., Ramos, S., Vázquez, D., & López, A. M. (2014). Domain adaptation of deformable part-based models. PAMI, 36(12), 2367–2380.
Article Google Scholar
Yang, W., Ouyang, W., Li, H., & Wang, X. (2016) End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In CVPR
Yasin, H., Iqbal, U., Krüger, B., Weber, A., & Gall, J. (2016) A dual-source approach for 3D pose estimation from a single image. In CVPR
Zhou, F., & De la Torre, F (2014) Spatio-temporal matching for human detection in video. In ECCV
Zhou, X., Huang, Q., Sun, X., Xue, X., & Wei, Y. (2017) Towards 3D human pose estimation in the wild: A weakly-supervised approach. In ICCV
Zhou, X., Sun, X., Zhang, W., Liang, S., & Wei, Y (2016) Deep kinematic pose regression. In ECCV Workshop on Geometry Meets Deep Learning
Zhou, X., Zhu, M., Leonardos, S., Derpanis, K., & Daniilidis, K. (2016) Sparseness meets deepness: 3D human pose estimation from monocular video. In CVPR
Zuffi, S., & Black, M.J. (2015) The stitched puppet: A graphical model of 3D human shape and pose. In CVPR

Download references

Acknowledgements

This work was supported by the European Commission under FP7 Marie Curie IOF Grant (PIOF-GA-2012-328288) and partially supported by the ERC advanced Grant ALLEGRO and an Amazon Academic Research Award (AARA). We acknowledge the support of NVIDIA with the donation of the GPUs used for this research. We thank Dr. Philippe Weinzaepfel for his help. We also thank the anonymous reviewers for their comments and suggestions that helped improve the paper.

Author information

Authors and Affiliations

Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP(Institute of Engineering Univ., Grenoble Alpes), LJK, 38000, Grenoble, France
Grégory Rogez & Cordelia Schmid

Authors

Grégory Rogez
View author publications
You can also search for this author in PubMed Google Scholar
Cordelia Schmid
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Grégory Rogez.

Additional information

Communicated by Adrien Gaidon, Florent Perronnin and Antonio Lopez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rogez, G., Schmid, C. Image-Based Synthesis for Deep 3D Human Pose Estimation. Int J Comput Vis 126, 993–1008 (2018). https://doi.org/10.1007/s11263-018-1071-9

Download citation

Received: 25 July 2017
Accepted: 26 February 2018
Published: 19 March 2018
Issue Date: September 2018
DOI: https://doi.org/10.1007/s11263-018-1071-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Image-Based Synthesis for Deep 3D Human Pose Estimation

Abstract

Access this article

Similar content being viewed by others

Unsupervised Human Pose Estimation on Depth Images

VirtualPose: Learning Generalizable 3D Human Pose Models from Virtual Data

A Tool for Building Multi-purpose and Multi-pose Synthetic Data Sets

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Image-Based Synthesis for Deep 3D Human Pose Estimation

Abstract

Access this article

Similar content being viewed by others

Unsupervised Human Pose Estimation on Depth Images

VirtualPose: Learning Generalizable 3D Human Pose Models from Virtual Data

A Tool for Building Multi-purpose and Multi-pose Synthetic Data Sets

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation