3D Human Shape and Pose from a Single Low-Resolution Image with Self-Supervised Learning

Xu, Xiangyu; Chen, Hao; Moreno-Noguer, Francesc; Jeni, László A.; De la Torre, Fernando

doi:10.1007/978-3-030-58545-7_17

Xiangyu Xu¹²,
Hao Chen¹³,
Francesc Moreno-Noguer¹⁴,
László A. Jeni¹² &
…
Fernando De la Torre^12,15

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12354))

Included in the following conference series:

European Conference on Computer Vision

4695 Accesses
18 Citations

Abstract

3D human shape and pose estimation from monocular images has been an active area of research in computer vision, having a substantial impact on the development of new applications, from activity recognition to creating virtual avatars. Existing deep learning methods for 3D human shape and pose estimation rely on relatively high-resolution input images; however, high-resolution visual content is not always available in several practical scenarios such as video surveillance and sports broadcasting. Low-resolution images in real scenarios can vary in a wide range of sizes, and a model trained in one resolution does not typically degrade gracefully across resolutions. Two common approaches to solve the problem of low-resolution input are applying super-resolution techniques to the input images which may result in visual artifacts, or simply training one model for each resolution, which is impractical in many realistic applications.

To address the above issues, this paper proposes a novel algorithm called RSC-Net, which consists of a Resolution-aware network, a Self-supervision loss, and a Contrastive learning scheme. The proposed network is able to learn the 3D body shape and pose across different resolutions with a single model. The self-supervision loss encourages scale-consistency of the output, and the contrastive learning scheme enforces scale-consistency of the deep features. We show that both these new training losses provide robustness when learning 3D shape and pose in a weakly-supervised manner. Extensive experiments demonstrate that the RSC-Net can achieve consistently better results than the state-of-the-art methods for challenging low-resolution images.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alldieck, T., Magnor, M., Bhatnagar, B.L., Theobalt, C., Pons-Moll, G.: Learning to reconstruct people in clothing from a single rgb camera. In: CVPR (2019)
Google Scholar
Alldieck, T., Magnor, M., Xu, W., Theobalt, C., Pons-Moll, G.: Video based reconstruction of 3d people models. In: CVPR (2018)
Google Scholar
Alldieck, T., Pons-Moll, G., Theobalt, C., Magnor, M.: Tex2shape: Detailed full human body geometry from a single image. In: ICCV (2019)
Google Scholar
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: New benchmark and state of the art analysis. In: CVPR (2014)
Google Scholar
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: ICML (2017)
Google Scholar
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In: ECCV (2016)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML (2020)
Google Scholar
Cheng, Z., Zhu, X., Gong, S.: Low-resolution face recognition. In: ACCV (2018)
Google Scholar
Doersch, C., Zisserman, A.: Sim2real transfer learning for 3d human pose estimation: motion to the rescue. In: NeurIPS (2019)
Google Scholar
Ge, S., Zhao, S., Li, C., Li, J.: Low-resolution face recognition in the wild via selective knowledge distillation. TIP 28(4), 2051–2062 (2018)
MathSciNet Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: NIPS (2014)
Google Scholar
Haris, M., Shakhnarovich, G., Ukita, N.: Task-driven super resolution: Object detection in low-resolution images. arXiv:1803.11316 (2018)
Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge University Press (2003)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: ECCV (2016)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv:1503.02531 (2015)
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. TPAMI 36(7), 1325–1339 (2013)
Article Google Scholar
Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: BMVC (2010)
Google Scholar
Johnson, S., Everingham, M.: Learning effective human pose estimation from inaccurate annotation. In: CVPR (2011)
Google Scholar
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)
Google Scholar
Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3d human dynamics from video. In: CVPR (2019)
Google Scholar
Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2014)
Google Scholar
Kocabas, M., Athanasiou, N., Black, M.J.: Vibe: Video inference for human body pose and shape estimation. In: CVPR (2020)
Google Scholar
Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In: ICCV (2019)
Google Scholar
Laine, S., Aila, T.: Temporal ensembling for semi-supervised learning. In: ICLR (2017)
Google Scholar
Li, J., Liang, X., Wei, Y., Xu, T., Feng, J., Yan, S.: Perceptual generative adversarial networks for small object detection. In: CVPR (2017)
Google Scholar
Lin, T.Y., et al.: Microsoft coco: Common objects in context. In: ECCV (2014)
Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: A skinned multi-person linear model. ACM Trans. Graph. 34(6), 248 (2015)
Article Google Scholar
Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Paul Smolley, S.: Least squares generative adversarial networks. In: ICCV (2017)
Google Scholar
von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3d human pose in the wild using imus and a moving camera. In: ECCV (2018)
Google Scholar
Mehta, D., et al.: Monocular 3d human pose estimation in the wild using improved cnn supervision. In: 3DV (2017)
Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: ICML (2010)
Google Scholar
Natsume, R., et al.: Siclope: Silhouette-based clothed people. In: CVPR (2019)
Google Scholar
Neumann, L., Vedaldi, A.: Tiny people pose. In: ACCV (2018)
Google Scholar
Nishibori, K., Takahashi, T., Deguchi, D., Ide, I., Murase, H.: Exemplar-based human body super-resolution for surveillance camera systems. In: International Conference on Computer Vision Theory and Applications (VISAPP) (2014)
Google Scholar
Noh, J., Bae, W., Lee, W., Seo, J., Kim, G.: Better to follow, follow to be better: Towards precise supervision of feature super-resolution for small object detection. In: ICCV (2019)
Google Scholar
Oh, S., et al.: A large-scale benchmark dataset for event recognition in surveillance video. In: CVPR (2011)
Google Scholar
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv:1807.03748 (2018)
Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3d human pose and shape from a single color image. In: CVPR (2018)
Google Scholar
Pumarola, A., Sanchez-Riera, J., Choi, G., Sanfeliu, A., Moreno-Noguer, F.: 3dpeople: Modeling the geometry of dressed humans. In: ICCV (2019)
Google Scholar
Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In: ICCV (2019)
Google Scholar
Tan, W., Yan, B., Bare, B.: Feature super-resolution: Make machine see more clearly. In: CVPR (2018)
Google Scholar
Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: NIPS (2017)
Google Scholar
Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. arXiv preprint arXiv:1906.05849 (2019)
Wang, Z., Chang, S., Yang, Y., Liu, D., Huang, T.S.: Studying very low resolution recognition using deep networks. In: CVPR (2016)
Google Scholar
Xu, X., Ma, Y., Sun, W.: Towards real scene super-resolution with raw images. In: CVPR (2019)
Google Scholar
Xu, X., Sun, D., Pan, J., Zhang, Y., Pfister, H., Yang, M.H.: Learning to super-resolve blurry face and text images. In: ICCV (2017)
Google Scholar
Zanfir, A., Marinoiu, E., Sminchisescu, C.: Monocular 3d pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. In: CVPR (2018)
Google Scholar
Zhang, J.Y., Felsen, P., Kanazawa, A., Malik, J.: Predicting 3d human dynamics from video. In: ICCV (2019)
Google Scholar
Zhang, Y., Tian, Y., Kong, Y., Zhong, B., Fu, Y.: Residual dense network for image super-resolution. In: CVPR (2018)
Google Scholar
Zheng, Z., Yu, T., Wei, Y., Dai, Q., Liu, Y.: Deephuman: 3d human reconstruction from a single image. In: ICCV (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Robotics Institute, Carnegie Mellon University, Pittsburgh, USA
Xiangyu Xu, László A. Jeni & Fernando De la Torre
Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, USA
Hao Chen
Institut de Robòtica i Informàtica Industrial (CSIC-UPC), Barcelona, Spain
Francesc Moreno-Noguer
Facebook Reality Labs (Oculus), Pittsburgh, USA
Fernando De la Torre

Authors

Xiangyu Xu
View author publications
You can also search for this author in PubMed Google Scholar
Hao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Francesc Moreno-Noguer
View author publications
You can also search for this author in PubMed Google Scholar
László A. Jeni
View author publications
You can also search for this author in PubMed Google Scholar
Fernando De la Torre
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1328 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, X., Chen, H., Moreno-Noguer, F., Jeni, L.A., De la Torre, F. (2020). 3D Human Shape and Pose from a Single Low-Resolution Image with Self-Supervised Learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12354. Springer, Cham. https://doi.org/10.1007/978-3-030-58545-7_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-58545-7_17
Published: 05 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58544-0
Online ISBN: 978-3-030-58545-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics