Abstract
This paper addresses the problem of 3D human pose estimation in the wild. A significant challenge is the lack of training data, i.e., 2D images of humans annotated with 3D poses. Such data is necessary to train state-of-the-art CNN architectures. Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations. We introduce an image-based synthesis engine that artificially augments a dataset of real images with 2D human pose annotations using 3D motion capture data. Given a candidate 3D pose, our algorithm selects for each joint an image whose 2D pose locally matches the projected 3D pose. The selected images are then combined to generate a new synthetic image by stitching local image patches in a kinematically constrained manner. The resulting images are used to train an end-to-end CNN for full-body 3D pose estimation. We cluster the training data into a large number of pose classes and tackle pose estimation as a K-way classification problem. Such an approach is viable only with large training sets such as ours. Our method outperforms most of the published works in terms of 3D pose estimation in controlled environments (Human3.6M) and shows promising results for real-world images (LSP). This demonstrates that CNNs trained on artificial images generalize well to real images. Compared to data generated from more classical rendering engines, our synthetic images do not require any domain adaptation or fine-tuning stage.
Similar content being viewed by others
Notes
ground truth classes being obtained by assigning the ground truth 2D pose to the closest cluster.
References
Agarwal, A., & Triggs, B. (2006). Recovering 3D human pose from monocular images. PAMI, 28(1), 44–58.
Akhter, I., & Black, M. (2015). Pose-conditioned joint angle limits for 3D human pose reconstruction. In CVPR
Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2D human pose estimation: New benchmark and state-of- the-art analysis. In CVPR
Bissacco, A., Yang, M.-H., & Soatto, S. (2006). Detecting humans via their pose. In NIPS
Bo, L., & Sminchisescu, C. (2010). Twin Gaussian processes for structured prediction. IJCV, 87(1–2), 28–52.
Bourdev, L., & Malik, J. (2009). Poselets: Body part detectors trained using 3D human pose annotations. In ICCV
Chen, C.-H. & Ramanan, D. (2017). 3D human pose estimation = 2D pose estimation + matching. In CVPR
Chen, W., Wang, H., Li, Y., Su, H., Wang, Z., Tu, C., Lischinski, D., Cohen-Or, D., & Chen, B. (2016). Synthesizing training images for boosting human 3D pose estimation. In 3DV
Chen, X., & Yuille, A.L. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS
de Souza, C. R., Gaidon, A., Cabon, Y., & Lopez, A.M. (2017). Procedural generation of videos to train deep action recognition networks. In CVPR
Dosovitskiy, A., Fischer, P., Ilg, E., Häusser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., & Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. In ICCV
Du, Y., Wong, Y., Liu, Y., Han, F., Gui, Y., Wang, Z., Kankanhalli, M., & Geng, W. (2016). Marker-less 3D human motion capture with monocular image sequence and height-maps. In ECCV
Elhayek, A., Aguiar, E., Jain, A., Tompson, J., Pishchulin, L., Andriluka, M., Bregler, C., Schiele, B., & Theobalt, C. (2015). Efficient convnet-based marker-less motion capture in general scenes with a low number of cameras. In CVPR
Enzweiler, M., & Gavrila, D.M. (2008). A mixed generative-discriminative framework for pedestrian classification. In CVPR
Fan, X., Zheng, K., Zhou, Y., & Wang, S. (2014). Pose locality constrained representation for 3D human pose reconstruction. In ECCV
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., & Bengio, Y. (2014). Generative adversarial nets. In NIPS
Hattori, H., Boddeti, V.N., Kitani, K.M., & Kanade, T. (2015). Learning scene-specific pedestrian detectors without real data. In CVPR
Hornung, A., Dekkers, E., & Kobbelt, L. (2007). Character animation from 2D pictures and 3D motion data. ACM Transactons On Graphics, 26(1), 1.
Huang, S., & Ramanan, D. (2017). Expecting the unexpected: Training detectors for unusual pedestrians with adversarial imposters. In CVPR.
Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human(3).6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. PAMI, 36(7), 1325–1339.
Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016). Reading text in the wild with convolutional neural networks. IJCV, 116(1), 1–20.
Jaderberg, M., Simonyan, K., Zisserman, A., & Kavukcuoglu, K. (2015). Spatial transformer networks. In NIPS
Johnson, S., & Everingham, M. (2010). Clustered pose and nonlinear appearance models for human pose estimation. In BMVC
Johnson, S., & Everingham, M. (2011). Learning effective human pose estimation from inaccurate annotation. In CVPR
Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara, S., & Sheikh, Y. (2015). Panoptic studio: A massively multiview system for social motion capture. In ICCV
Kostrikov, I., & Gall, J. (2014). Depth sweep regression forests for estimating 3D human pose from images. In BMVC
Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS
Li, S., Zhang, W., & Chan, A.B. (2015). Maximum-margin structured learning with deep networks for 3D human pose estimation. In ICCV
Li, S., Zhang, W., & Chan, A.B. (2016). Maximum-margin structured learning with deep networks for 3D human pose estimation. In IJCV
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., & Black, M. J. (2015). SMPL: A skinned multi-person linear model. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia), 34(6), 248:1–248:16.
Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., & Theobalt, C. (2017). Monocular 3D human pose estimation in the wild using improved CNN supervision. In 3D Vision (3DV)
Moreno-Noguer, F. (2017). 3D human pose estimation from a single image via distance matrix regression. In CVPR
Mori, G., & Malik, J. (2006). Recovering 3D human body configurations using shape contexts. PAMI, 28(7), 1052–1062.
Okada, R., & Soatto, S. (2008). Relevant feature selection for human pose estimation and localization in cluttered images. In ECCV
Park, D., & Ramanan, D. (2015). Articulated pose estimation with tiny synthetic videos. In CVPR ChaLearn Looking at People Workshop
Pavlakos, G., Zhou, X., Derpanis, K.G., & Daniilidis, K. (2017). Coarse-to-fine volumetric prediction for single-image 3D human pose. In CVPR
Peng, X., Sun, B., Ali, K., & Saenko, K. (2015). Learning deep object detectors from 3D models. In ICCV
Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., & Schiele, B. (2016). DeepCut: Joint subset partition and labeling for multi person pose estimation. CVPR
Pishchulin, L., Jain, A., Andriluka, M., T. Thormählen, & Schiele, B. (2012). Articulated people detection and pose estimation: Reshaping the future. In CVPR
Ramakrishna, V., Kanade, T., & Sheikh, Y. (2012). Reconstructing 3D human pose from 2D image landmarks. In ECCV
Rogez, G., Rihan, J., Orrite, C., & Torr, P. (2012). Fast human pose detection using randomized hierarchical cascades of rejectors. IJCV, 99(1), 25–52.
Rogez, G., & Schmid, C. (2016). MoCap-guided data augmentation for 3D pose estimation in the wild. In NIPS
Rogez, G., Supancic, J., & Ramanan, D. (2015). First-person pose recognition using egocentric workspaces. In CVPR
Rogez, G., Weinzaepfel, P., & Schmid, C. (2017). LCR-Net: Localization-Classification-Regression for human pose. In CVPR
Romero, J., Kjellstrom, H., & Kragic, D. (2010). Hands in action: Real-time 3D reconstruction of hands in interaction with objects. In ICRA
Sanzari, M., Ntouskos, V., & Pirri, F. (2016). Bayesian image based 3D pose estimation. In ECCV
Shakhnarovich, G., Viola, P.A., & Darrell, T. (2003). Fast pose estimation with parameter-sensitive hashing. In ICCV
Shotton, J., Fitzgibbon, A.W., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from single depth images. In CVPR
Sigal, L., Balan, A. O., & Black, M. J. (2010). Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. IJCV, 87(1–2), 4–27.
Sigal, L., & Black, M.J. (2006). Predicting 3D people from 2D pictures. In AMDO
Simo-Serra, E., Quattoni, A., Torras, C., & Moreno-Noguer, F. (2013). A joint model for 2D and 3D pose estimation from a single image. In CVPR
Simo-Serra, E., Ramisa, A., G. Alenyà, Torras, C., & Moreno-Noguer, F. (2012). Single image 3D human pose estimation from noisy observations. In CVPR
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556
Su, H., Ruizhongtai, C., Qi, Y.Li, & Guibas, L.J. (2015). Render for CNN: Viewpoint estimation in images using CNNs trained with rendered 3D model views. In ICCV
Taylor, J. C. (2000). Reconstruction of articulated objects from point correspondences in a single uncalibrated image. In CVPR
Tekin, B., Katircioglu, I., Salzmann, M., Lepetit, V., & Fua, P. (2016). Structured prediction of 3D human pose with deep neural networks. In BMVC
Tekin, B., Rozantsev, A., Lepetit, V., & Fua, P. (2016). Direct prediction of 3D body poses from motion compensated sequences. In CVPR
Tome, D., Russell, C., & Agapito, L. (2017). Lifting from the deep: Convolutional 3D pose estimation from a single image. In CVPR
Tompson, J.J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS
Toshev, A., & Szegedy C. (2014) DeepPose: Human pose estimation via deep neural networks. In CVPR
Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M.J., Laptev, I., & Schmid, C. (2017). Learning from synthetic humans. In CVPR
Wang, C., Wang, Y., Lin, Z., Yuille, A. L., & Gao, W. (2014). Robust estimation of 3D human poses from a single image. In CVPR
Wei, S.-E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016) Convolutional pose machines. In CVPR
Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., & Xiao, J (2015) 3D shapenets: A deep representation for volumetric shapes. In CVPR
Xu, J., Ramos, S., Vázquez, D., & López, A. M. (2014). Domain adaptation of deformable part-based models. PAMI, 36(12), 2367–2380.
Yang, W., Ouyang, W., Li, H., & Wang, X. (2016) End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In CVPR
Yasin, H., Iqbal, U., Krüger, B., Weber, A., & Gall, J. (2016) A dual-source approach for 3D pose estimation from a single image. In CVPR
Zhou, F., & De la Torre, F (2014) Spatio-temporal matching for human detection in video. In ECCV
Zhou, X., Huang, Q., Sun, X., Xue, X., & Wei, Y. (2017) Towards 3D human pose estimation in the wild: A weakly-supervised approach. In ICCV
Zhou, X., Sun, X., Zhang, W., Liang, S., & Wei, Y (2016) Deep kinematic pose regression. In ECCV Workshop on Geometry Meets Deep Learning
Zhou, X., Zhu, M., Leonardos, S., Derpanis, K., & Daniilidis, K. (2016) Sparseness meets deepness: 3D human pose estimation from monocular video. In CVPR
Zuffi, S., & Black, M.J. (2015) The stitched puppet: A graphical model of 3D human shape and pose. In CVPR
Acknowledgements
This work was supported by the European Commission under FP7 Marie Curie IOF Grant (PIOF-GA-2012-328288) and partially supported by the ERC advanced Grant ALLEGRO and an Amazon Academic Research Award (AARA). We acknowledge the support of NVIDIA with the donation of the GPUs used for this research. We thank Dr. Philippe Weinzaepfel for his help. We also thank the anonymous reviewers for their comments and suggestions that helped improve the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Adrien Gaidon, Florent Perronnin and Antonio Lopez.
Rights and permissions
About this article
Cite this article
Rogez, G., Schmid, C. Image-Based Synthesis for Deep 3D Human Pose Estimation. Int J Comput Vis 126, 993–1008 (2018). https://doi.org/10.1007/s11263-018-1071-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-018-1071-9