DOPE: Distillation of Part Experts for Whole-Body 3D Pose Estimation in the Wild

Weinzaepfel, Philippe; Brégier, Romain; Combaluzier, Hadrien; Leroy, Vincent; Rogez, Grégory

doi:10.1007/978-3-030-58574-7_23

Philippe Weinzaepfel¹²,
Romain Brégier¹²,
Hadrien Combaluzier¹²,
Vincent Leroy¹² &
…
Grégory Rogez¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12371))

Included in the following conference series:

European Conference on Computer Vision

3090 Accesses
24 Citations

Abstract

We introduce DOPE, the first method to detect and estimate whole-body 3D human poses, including bodies, hands and faces, in the wild. Achieving this level of details is key for a number of applications that require understanding the interactions of the people with each other or with the environment. The main challenge is the lack of in-the-wild data with labeled whole-body 3D poses. In previous work, training data has been annotated or generated for simpler tasks focusing on bodies, hands or faces separately. In this work, we propose to take advantage of these datasets to train independent experts for each part, namely a body, a hand and a face expert, and distill their knowledge into a single deep network designed for whole-body 2D-3D pose detection. In practice, given a training image with partial or no annotation, each part expert detects its subset of keypoints in 2D and 3D and the resulting estimations are combined to obtain whole-body pseudo ground-truth poses. A distillation loss encourages the whole-body predictions to mimic the experts’ outputs. Our results show that this approach significantly outperforms the same whole-body model trained without distillation while staying close to the performance of the experts. Importantly, DOPE is computationally less demanding than the ensemble of experts and can achieve real-time performance. Test code and models are available at https://europe.naverlabs.com/research/computer-vision/dope.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR (2014)
Google Scholar
Armagan, A., et al.: Measuring generalisation to unseen viewpoints, articulations, shapes and objects for 3D hand pose estimation under hand-object interaction. In: ECCV (2020)
Google Scholar
Arnab, A., Doersch, C., Zisserman, A.: Exploiting temporal context for 3D human pose estimation in the wild. In: CVPR (2019)
Google Scholar
Bhardwaj, S., Srinivasan, M., Khapra, M.M.: Efficient video classification using fewer frames. In: CVPR (2019)
Google Scholar
Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: SIGGRAPH (1999)
Google Scholar
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: ECCV (2016)
Google Scholar
Booth, J., Roussos, A., Zafeiriou, S., Ponniahy, A., Dunaway, D.: A 3D morphable model learnt from 10,000 faces. In: CVPR (2016)
Google Scholar
Boukhayma, A., de Bem, R., Torr, P.H.S.: 3D hand shape and pose from images in the wild. In: CVPR (2019)
Google Scholar
Bulat, A., Tzimiropoulos, G.: How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230,000 3D facial landmarks). In: ICCV (2017)
Google Scholar
Cai, Y., Ge, L., Cai, J., Yuan, J.: Weakly-supervised 3D hand pose estimation from monocular RGB images. In: ECCV (2018)
Google Scholar
Cao, C., Weng, Y., Zhou, S., Tong, Y., Zhou, K.: FaceWarehouse: a 3D facial expression database for visual computing. IEEE Trans. Vis. Comput. Graph. 20(3), 413–425 (2013)
Google Scholar
Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. In: arXiv preprint arXiv:1812.08008 (2018)
Chen, C.H., Ramanan, D.: 3D human pose estimation = 2D pose estimation + matching. In: CVPR (2017)
Google Scholar
Chen, G., Choi, W., Yu, X., Han, T., Chandraker, M.: Learning efficient object detection models with knowledge distillation. In: NeurIPS (2017)
Google Scholar
Cimen, G., Maurhofer, C., Sumner, B., Guay, M.: AR poser: automatically augmenting mobile pictures with digital avatars imitating poses. In: CGVCVIP (2018)
Google Scholar
Crasto, N., Weinzaepfel, P., Alahari, K., Schmid, C.: MARS: motion-augmented RGB stream for action recognition. In: CVPR (2019)
Google Scholar
Crispell, D., Bazik, M.: Pix2Face: direct 3D face model estimation. In: ICCV Workshop (2017)
Google Scholar
Deng, J., et al.: The Menpo benchmark for multi-pose 2D and 3D facial landmark localisation and tracking. IJCV 127(6–7), 599–624 (2019)
Article Google Scholar
Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: ICCV (2017)
Google Scholar
Feng, Y., Wu, F., Shao, X., Wang, Y., Zhou, X.: Joint 3D face reconstruction and dense alignment with position map regression network. In: ECCV (2020)
Google Scholar
Garcia-Salguero, M., Gonzalez-Jimenez, J., Moreno, F.A.: Human 3D pose estimation with a tilting camera for social mobile robot interaction. Sensors 19(22), 4943 (2019)
Article Google Scholar
Ge, L., et al.: 3D hand shape and pose estimation from a single RGB image. In: CVPR (2019)
Google Scholar
Gui, L.Y., Zhang, K., Wang, Y.X., Liang, X., Moura, J.M., Veloso, M.: Teaching robots to predict human motion. In: IROS (2018)
Google Scholar
Habibie, I., Xu, W., Mehta, D., Pons-Moll, G., Theobalt, C.: In the wild human pose estimation using explicit 2D features and intermediate 3D representations. In: CVPR (2019)
Google Scholar
Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: Honnotate: a method for 3D annotation of hand and objects poses. In: CVPR (2020)
Google Scholar
Hasson, Y., et al.: Learning joint reconstruction of hands and manipulated objects. In: CVPR (2019)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Hidalgo, G., et al.: Single-network whole-body pose estimation. In: ICCV (2019)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPS Workshop (2014)
Google Scholar
Hoffman, J., Gupta, S., Darrell, T.: Learning with side information through modality hallucination. In: CVPR (2016)
Google Scholar
Hou, S., Pan, X., Change Loy, C., Wang, Z., Lin, D.: Lifelong learning via progressive distillation and retrospection. In: ECCV (2018)
Google Scholar
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. PAMI 36(7), 1325–1339 (2013)
Google Scholar
Jackson, A.S., Bulat, A., Argyriou, V., Tzimiropoulos, G.: Large pose 3D face reconstruction from a single image via direct volumetric CNN regression. In: ICCV (2017)
Google Scholar
Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: BMVC (2010)
Google Scholar
Johnson, S., Everingham, M.: Learning effective human pose estimation from inaccurate annotation. In: CVPR (2011)
Google Scholar
Joo, H., Simon, T., Sheikh, Y.: Total capture: a 3D deformation model for tracking faces, hands, and bodies. In: CVPR (2018)
Google Scholar
Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: closing the loop between 3D and 2D human representations. In: CVPR (2017)
Google Scholar
Lee, D.H.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: ICML Workshop (2013)
Google Scholar
Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. (ToG) 36(6), 194 (2017)
Google Scholar
Lin, T.Y., et al.: Microsoft coco: common objects in context. In: ECCV (2014)
Google Scholar
Liu, X., He, P., Chen, W., Gao, J.: Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv preprint arXiv:1904.09482 (2019)
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6), 1–16 (2015)
Article Google Scholar
Lopez-Paz, D., Bottou, L., Schölkopf, B., Vapnik, V.: Unifying distillation and privileged information. In: ICLR (2016)
Google Scholar
Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: ICCV (2017)
Google Scholar
Mehta, D., et al.: Single-shot multi-person 3D pose estimation from monocular RGB. In: 3DV (2018)
Google Scholar
Mehta, D., et al.: VNect: real-time 3D human pose estimation with a single RGB camera. ACM Trans. Graph. 36(4), 1–14 (2017)
Article Google Scholar
Moon, G., Chang, J.Y., Lee, K.M.: Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In: ICCV (2019)
Google Scholar
Mueller, F., et al.: GANerated hands for real-time 3D hand tracking from monocular RGB. In: CVPR (2018)
Google Scholar
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)
Google Scholar
Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR (2019)
Google Scholar
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3D human pose. In: CVPR (2017)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)
Google Scholar
Rogez, G., Schmid, C.: Mocap-guided data augmentation for 3D pose estimation in the wild. In: NIPS (2016)
Google Scholar
Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-Net++: multi-person 2D and 3D pose detection in natural images. IEEE Trans. PAMI 42(5), 1146–1161 (2019)
Google Scholar
Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. ACM Trans. Graph. 36(6), 245 (2017)
Article Google Scholar
Sanyal, S., Bolkart, T., Feng, H., Black, M.J.: Learning to regress 3D face shape and expression from an image without 3D supervision. In: CVPR (2019)
Google Scholar
Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. In: CVPR (2017)
Google Scholar
Spurr, A., Song, J., Park, S., Hilliges, O.: Cross-modal deep variational hand pose estimation. In: CVPR (2018)
Google Scholar
Supančič, J.S., Rogez, G., Yang, Y., Shotton, J., Ramanan, D.: Depth-based hand pose estimation: methods, data, and challenges. IJCV 126(11), 1180–1198 (2018)
Article Google Scholar
Vapnik, V., Izmailov, R.: Learning using privileged information: similarity control and knowledge transfer. JMLR 16(1), 2023–2049 (2015)
MathSciNet MATH Google Scholar
Varol, G., et al.: Learning from synthetic humans. In: CVPR (2017)
Google Scholar
Wu, Y., Ji, Q.: Facial landmark detection: a literature survey. IJCV 127(2), 115–142 (2019)
Article Google Scholar
Xiang, D., Joo, H., Sheikh, Y.: Monocular total capture: Posing face, body, and hands in the wild. In: CVPR (2019)
Google Scholar
Xiong, P., Li, G., Sun, Y.: Combining local and global features for 3D face tracking. In: ICCV Workshops (2017)
Google Scholar
Yang, L., Li, S., Lee, D., Yao, A.: Aligning latent spaces for 3D hand pose estimation. In: ICCV (2019)
Google Scholar
Yuan, S., Stenger, B., Kim, T.K.: RGB-based 3D hand pose estimation via privileged learning with depth images. arXiv preprint arXiv:1811.07376 (2018)
Zadeh, A., Baltrusaitis, T., Morency, L.P.: Convolutional experts constrained local model for facial landmark detection. In: CVPR Workshop (2017)
Google Scholar
Zafeiriou, S., Chrysos, G., Roussos, A., Ververas, E., Deng, J., Trigeorgis, G.: The 3D menpo facial landmark tracking challenge. In: ICCV Workshops (2017)
Google Scholar
Zhang, J., Jiao, J., Chen, M., Qu, L., Xu, X., Yang, Q.: A hand pose tracking benchmark from stereo matching. In: ICIP (2017)
Google Scholar
Zhang, X., Li, Q., Mo, H., Zhang, W., Zheng, W.: End-to-end hand mesh recovery from a monocular RGB image. In: ICCV (2019)
Google Scholar
Zhu, X., Liu, X., Lei, Z., Li, S.Z.: Face alignment in full pose range: a 3D total solution. IEEE Trans. PAMI 41(1), 78–92 (2017)
Article Google Scholar
Zimmermann, C., Brox, T.: Learning to estimate 3D hand pose from single RGB images. In: ICCV (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

NAVER LABS Europe, Meylan, France
Philippe Weinzaepfel, Romain Brégier, Hadrien Combaluzier, Vincent Leroy & Grégory Rogez

Authors

Philippe Weinzaepfel
View author publications
You can also search for this author in PubMed Google Scholar
Romain Brégier
View author publications
You can also search for this author in PubMed Google Scholar
Hadrien Combaluzier
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Leroy
View author publications
You can also search for this author in PubMed Google Scholar
Grégory Rogez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Philippe Weinzaepfel .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Weinzaepfel, P., Brégier, R., Combaluzier, H., Leroy, V., Rogez, G. (2020). DOPE: Distillation of Part Experts for Whole-Body 3D Pose Estimation in the Wild. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12371. Springer, Cham. https://doi.org/10.1007/978-3-030-58574-7_23

Download citation

DOI: https://doi.org/10.1007/978-3-030-58574-7_23
Published: 13 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58573-0
Online ISBN: 978-3-030-58574-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics