3D Human Shape and Pose from a Single Low-Resolution Image with Self-Supervised Learning

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12354)


3D human shape and pose estimation from monocular images has been an active area of research in computer vision, having a substantial impact on the development of new applications, from activity recognition to creating virtual avatars. Existing deep learning methods for 3D human shape and pose estimation rely on relatively high-resolution input images; however, high-resolution visual content is not always available in several practical scenarios such as video surveillance and sports broadcasting. Low-resolution images in real scenarios can vary in a wide range of sizes, and a model trained in one resolution does not typically degrade gracefully across resolutions. Two common approaches to solve the problem of low-resolution input are applying super-resolution techniques to the input images which may result in visual artifacts, or simply training one model for each resolution, which is impractical in many realistic applications.

To address the above issues, this paper proposes a novel algorithm called RSC-Net, which consists of a Resolution-aware network, a Self-supervision loss, and a Contrastive learning scheme. The proposed network is able to learn the 3D body shape and pose across different resolutions with a single model. The self-supervision loss encourages scale-consistency of the output, and the contrastive learning scheme enforces scale-consistency of the deep features. We show that both these new training losses provide robustness when learning 3D shape and pose in a weakly-supervised manner. Extensive experiments demonstrate that the RSC-Net can achieve consistently better results than the state-of-the-art methods for challenging low-resolution images.


3d human shape and pose Low-resolution Neural network Self-supervised learning Contrastive learning. 

Supplementary material

504446_1_En_17_MOESM1_ESM.pdf (1.3 mb)
Supplementary material 1 (pdf 1328 KB)


  1. 1.
    Alldieck, T., Magnor, M., Bhatnagar, B.L., Theobalt, C., Pons-Moll, G.: Learning to reconstruct people in clothing from a single rgb camera. In: CVPR (2019)Google Scholar
  2. 2.
    Alldieck, T., Magnor, M., Xu, W., Theobalt, C., Pons-Moll, G.: Video based reconstruction of 3d people models. In: CVPR (2018)Google Scholar
  3. 3.
    Alldieck, T., Pons-Moll, G., Theobalt, C., Magnor, M.: Tex2shape: Detailed full human body geometry from a single image. In: ICCV (2019)Google Scholar
  4. 4.
    Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: New benchmark and state of the art analysis. In: CVPR (2014)Google Scholar
  5. 5.
    Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: ICML (2017)Google Scholar
  6. 6.
    Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In: ECCV (2016)Google Scholar
  7. 7.
    Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML (2020)Google Scholar
  8. 8.
    Cheng, Z., Zhu, X., Gong, S.: Low-resolution face recognition. In: ACCV (2018)Google Scholar
  9. 9.
    Doersch, C., Zisserman, A.: Sim2real transfer learning for 3d human pose estimation: motion to the rescue. In: NeurIPS (2019)Google Scholar
  10. 10.
    Ge, S., Zhao, S., Li, C., Li, J.: Low-resolution face recognition in the wild via selective knowledge distillation. TIP 28(4), 2051–2062 (2018)MathSciNetGoogle Scholar
  11. 11.
    Goodfellow, I., et al.: Generative adversarial nets. In: NIPS (2014)Google Scholar
  12. 12.
    Haris, M., Shakhnarovich, G., Ukita, N.: Task-driven super resolution: Object detection in low-resolution images. arXiv:1803.11316 (2018)
  13. 13.
    Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge University Press (2003)Google Scholar
  14. 14.
    He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020)Google Scholar
  15. 15.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  16. 16.
    He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: ECCV (2016)Google Scholar
  17. 17.
    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv:1503.02531 (2015)
  18. 18.
    Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. TPAMI 36(7), 1325–1339 (2013)CrossRefGoogle Scholar
  19. 19.
    Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: BMVC (2010)Google Scholar
  20. 20.
    Johnson, S., Everingham, M.: Learning effective human pose estimation from inaccurate annotation. In: CVPR (2011)Google Scholar
  21. 21.
    Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)Google Scholar
  22. 22.
    Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3d human dynamics from video. In: CVPR (2019)Google Scholar
  23. 23.
    Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2014)Google Scholar
  24. 24.
    Kocabas, M., Athanasiou, N., Black, M.J.: Vibe: Video inference for human body pose and shape estimation. In: CVPR (2020)Google Scholar
  25. 25.
    Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In: ICCV (2019)Google Scholar
  26. 26.
    Laine, S., Aila, T.: Temporal ensembling for semi-supervised learning. In: ICLR (2017)Google Scholar
  27. 27.
    Li, J., Liang, X., Wei, Y., Xu, T., Feng, J., Yan, S.: Perceptual generative adversarial networks for small object detection. In: CVPR (2017)Google Scholar
  28. 28.
    Lin, T.Y., et al.: Microsoft coco: Common objects in context. In: ECCV (2014)Google Scholar
  29. 29.
    Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: A skinned multi-person linear model. ACM Trans. Graph. 34(6), 248 (2015)CrossRefGoogle Scholar
  30. 30.
    Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Paul Smolley, S.: Least squares generative adversarial networks. In: ICCV (2017)Google Scholar
  31. 31.
    von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3d human pose in the wild using imus and a moving camera. In: ECCV (2018)Google Scholar
  32. 32.
    Mehta, D., et al.: Monocular 3d human pose estimation in the wild using improved cnn supervision. In: 3DV (2017)Google Scholar
  33. 33.
    Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: ICML (2010)Google Scholar
  34. 34.
    Natsume, R., et al.: Siclope: Silhouette-based clothed people. In: CVPR (2019)Google Scholar
  35. 35.
    Neumann, L., Vedaldi, A.: Tiny people pose. In: ACCV (2018)Google Scholar
  36. 36.
    Nishibori, K., Takahashi, T., Deguchi, D., Ide, I., Murase, H.: Exemplar-based human body super-resolution for surveillance camera systems. In: International Conference on Computer Vision Theory and Applications (VISAPP) (2014)Google Scholar
  37. 37.
    Noh, J., Bae, W., Lee, W., Seo, J., Kim, G.: Better to follow, follow to be better: Towards precise supervision of feature super-resolution for small object detection. In: ICCV (2019)Google Scholar
  38. 38.
    Oh, S., et al.: A large-scale benchmark dataset for event recognition in surveillance video. In: CVPR (2011)Google Scholar
  39. 39.
    Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv:1807.03748 (2018)
  40. 40.
    Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3d human pose and shape from a single color image. In: CVPR (2018)Google Scholar
  41. 41.
    Pumarola, A., Sanchez-Riera, J., Choi, G., Sanfeliu, A., Moreno-Noguer, F.: 3dpeople: Modeling the geometry of dressed humans. In: ICCV (2019)Google Scholar
  42. 42.
    Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In: ICCV (2019)Google Scholar
  43. 43.
    Tan, W., Yan, B., Bare, B.: Feature super-resolution: Make machine see more clearly. In: CVPR (2018)Google Scholar
  44. 44.
    Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: NIPS (2017)Google Scholar
  45. 45.
    Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. arXiv preprint arXiv:1906.05849 (2019)
  46. 46.
    Wang, Z., Chang, S., Yang, Y., Liu, D., Huang, T.S.: Studying very low resolution recognition using deep networks. In: CVPR (2016)Google Scholar
  47. 47.
    Xu, X., Ma, Y., Sun, W.: Towards real scene super-resolution with raw images. In: CVPR (2019)Google Scholar
  48. 48.
    Xu, X., Sun, D., Pan, J., Zhang, Y., Pfister, H., Yang, M.H.: Learning to super-resolve blurry face and text images. In: ICCV (2017)Google Scholar
  49. 49.
    Zanfir, A., Marinoiu, E., Sminchisescu, C.: Monocular 3d pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. In: CVPR (2018)Google Scholar
  50. 50.
    Zhang, J.Y., Felsen, P., Kanazawa, A., Malik, J.: Predicting 3d human dynamics from video. In: ICCV (2019)Google Scholar
  51. 51.
    Zhang, Y., Tian, Y., Kong, Y., Zhong, B., Fu, Y.: Residual dense network for image super-resolution. In: CVPR (2018)Google Scholar
  52. 52.
    Zheng, Z., Yu, T., Wei, Y., Dai, Q., Liu, Y.: Deephuman: 3d human reconstruction from a single image. In: ICCV (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Robotics Institute, Carnegie Mellon UniversityPittsburghUSA
  2. 2.Electrical and Computer EngineeringCarnegie Mellon UniversityPittsburghUSA
  3. 3.Institut de Robòtica i Informàtica Industrial (CSIC-UPC)BarcelonaSpain
  4. 4.Facebook Reality Labs (Oculus)PittsburghUSA

Personalised recommendations