Skip to main content

NormalGAN: Learning Detailed 3D Human from a Single RGB-D Image

  • Conference paper
  • First Online:
Computer Vision – ECCV 2020 (ECCV 2020)

Abstract

We propose NormalGAN, a fast adversarial learning-based method to reconstruct the complete and detailed 3D human from a single RGB-D image. Given a single front-view RGB-D image, NormalGAN performs two steps: front-view RGB-D rectification and back-view RGB-D inference. The final model was then generated by simply combining the front-view and back-view RGB-D information. However, inferring back-view RGB-D image with high-quality geometric details and plausible texture is not trivial. Our key observation is: Normal maps generally encode much more information of 3D surface details than RGB and depth images. Therefore, learning geometric details from normal maps is superior than other representations. In NormalGAN, an adversarial learning framework conditioned by normal maps is introduced, which is used to not only improve the front-view depth denoising performance, but also infer the back-view depth image with surprisingly geometric details. Moreover, for texture recovery, we remove shading information from the front-view RGB image based on the refined normal map, which further improves the quality of the back-view color inference. Results and experiments on both testing data set and real captured data demonstrate the superior performance of our approach. Given a consumer RGB-D sensor, NormalGAN can generate the complete and detailed 3D human reconstruction results in 20 fps, which further enables convenient interactive experiences in telepresence, AR/VR and gaming scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://web.twindom.com/.

References

  1. Cui, Y., Chang, W., Nöll, T., Stricker, D.: KinectAvatar: fully automatic body capture using a single kinect. In: Park, J.-I., Kim, J. (eds.) ACCV 2012. LNCS, vol. 7729, pp. 133–147. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37484-5_12

    Chapter  Google Scholar 

  2. Dou, M., et al.: Fusion4D: real-time performance capture of challenging scenes. ACM Trans. Graph. 35(4) (2016). https://doi.org/10.1145/2897824.2925969

  3. Fankhauser, P., Bloesch, M., Rodriguez, D., Kaestner, R., Hutter, M., Siegwart, R.: Kinect v2 for mobile robot navigation: Evaluation and modeling. In: 2015 International Conference on Advanced Robotics (ICAR), pp. 388–394. IEEE (2015)

    Google Scholar 

  4. Gabeur, V., Franco, J., Martin, X., Schmid, C., Rogez, G.: Moulding humans: non-parametric 3D human shape estimation from single images. CoRR abs/1908.00439 (2019). http://arxiv.org/abs/1908.00439

  5. Guan, P., Weiss, A., Balan, A., Black, M.J.: Estimating human shape and pose from a single image. In: International Conference on Computer Vision, ICCV, pp. 1381–1388 (2009)

    Google Scholar 

  6. Guo, K., Xu, F., Yu, T., Liu, X., Dai, Q., Liu, Y.: Real-time geometry, albedo, and motion reconstruction using a single RGB-D camera. ACM Trans. Graph. 36(3), 32:1–32:13 (2017). https://doi.org/10.1145/3083722

  7. Han, Y., Lee, J., Kweon, I.S.: High quality shape from a single RGB-D image under uncalibrated natural illumination. In: International Conference on Computer Vision, pp. 1617–1624 (2013)

    Google Scholar 

  8. Innmann, M., Zollhöfer, M., Nießner, M., Theobalt, C., Stamminger, M.: VolumeDeform: real-time volumetric non-rigid reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 362–379. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_22

    Chapter  Google Scholar 

  9. Izadi, S., et al.: KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. In: Proceedings UIST, pp. 559–568. ACM (2011)

    Google Scholar 

  10. Johnson, J., Alahi, A., Feifei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Computer Vision and Pattern Recognition (2016)

    Google Scholar 

  11. Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Computer Vision and Pattern Recognition, pp. 7122–7131 (2018)

    Google Scholar 

  12. Li, C., Zhao, Z., Guo, X.: ArticulatedFusion: real-time reconstruction of motion, geometry and segmentation using a single depth camera. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 324–340. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01237-3_20

    Chapter  Google Scholar 

  13. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34(6), 248:1–248:16 (2015)

    Google Scholar 

  14. Natsume, R., et al.: SiCloPe: silhouette-based clothed people. In: Computer Vision and Pattern Recognition, pp. 4480–4490 (2019)

    Google Scholar 

  15. Newcombe, R.A., Fox, D., Seitz, S.M.: DynamicFusion: reconstruction and tracking of non-rigid scenes in real-time. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 343–352. IEEE, Boston (2015)

    Google Scholar 

  16. Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. http://smpl-x.is.tue.mpg.de

  17. Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3D human pose and shape from a single color image. In: Computer Vision and Pattern Recognition, pp. 459–468 (2018)

    Google Scholar 

  18. Petrov, A.P.: On obtaining shape from color shading. Color Res. Appl. 18(6), 375–379 (1993)

    Article  Google Scholar 

  19. Rhodin, H., Robertini, N., Casas, D., Richardt, C., Seidel, H., Theobalt, C.: General automatic human shape and motion capture using volumetric contour cues. In: Computer Vision and Pattern Recognition (2016)

    Google Scholar 

  20. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Computer Vision and Pattern Recognition (2015)

    Google Scholar 

  21. Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: PIFU: pixel-aligned implicit function for high-resolution clothed human digitization. CoRR abs/1905.05172 (2019). http://arxiv.org/abs/1905.05172

  22. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Computer Vision and Pattern Recognition (2014)

    Google Scholar 

  23. Slavcheva, M., Baust, M., Cremers, D., Ilic, S.: KillingFusion: non-rigid 3D reconstruction without correspondences. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5474–5483. IEEE, Honolulu (2017)

    Google Scholar 

  24. Slavcheva, M., Baust, M., Ilic, S.: SobolevFusion: 3D reconstruction of scenes undergoing free non-rigid motion. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2646–2655. IEEE, Salt Lake City, June 2018

    Google Scholar 

  25. Smith, D., Loper, M., Hu, X., Mavroidis, P., Romero, J.: FACSIMILE: fast and accurate scans from an image in less than a second. In: The IEEE International Conference on Computer Vision (ICCV), pp. 5330–5339 (2019)

    Google Scholar 

  26. Sterzentsenko, V., et al.: Self-supervised deep depth denoising. arXiv: Computer Vision and Pattern Recognition (2019)

  27. Tong, J., Zhou, J., Liu, L., Pan, Z., Yan, H.: Scanning 3D full human bodies using Kinects. IEEE Trans. Visual Comput. Graphics 18(4), 643–650 (2012)

    Article  Google Scholar 

  28. Varol, G., et al.: BodyNet: volumetric inference of 3d human body shapes. In: Computer Vision and Pattern Recognition (2018)

    Google Scholar 

  29. Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807 (2018)

    Google Scholar 

  30. Wu, C., Varanasi, K., Liu, Y., Seidel, H., Theobalt, C.: Shading-based dynamic shape refinement from multi-view video under general illumination. In: 2011 International Conference on Computer Vision, pp. 1108–1115, November 2011. https://doi.org/10.1109/ICCV.2011.6126358

  31. Wu, C., Stoll, C., Valgaerts, L., Theobalt, C.: On-set performance capture of multiple actors with a stereo camera. ACM Trans. Graph. 32, 1–11 (2013). https://doi.org/10.1145/2508363.2508418

  32. Yan, S., et al.: DDRNet: depth map denoising and refinement for consumer depth cameras using cascaded CNNs. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 155–171. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_10

    Chapter  Google Scholar 

  33. Yu, L., Yeung, S., Tai, Y., Lin, S.: Shading-based shape refinement of RGB-D images. In: Computer Vision and Pattern Recognition, pp. 1415–1422 (2013)

    Google Scholar 

  34. Yu, T., et al.: BodyFusion: real-time capture of human motion and surface geometry using a single depth camera. In: IEEE International Conference on Computer Vision (ICCV), pp. 910–919. IEEE, Venice (2017)

    Google Scholar 

  35. Yu, T., et al.: DoubleFusion: real-time capture of human performances with inner body shapes from a single depth sensor. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018, pp. 7287–7296 (2018)

    Google Scholar 

  36. Zeng, M., Zheng, J., Cheng, X., Liu, X.: Templateless quasi-rigid shape modeling with implicit loop-closure. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013

    Google Scholar 

  37. Zhang, R., Tsai, P., Cryer, J.E., Shah, M.: Shape-from-shading: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 21(8), 690–706 (1999)

    Article  Google Scholar 

  38. Zheng, Z., Yu, T., Wei, Y., Dai, Q., Liu, Y.: DeepHuman: 3D human reconstruction from a single image. CoRR abs/1903.06473 (2019). http://arxiv.org/abs/1903.06473

Download references

Acknowledgement

This paper is supported by the National Key Research and Development Program of China [2018YFB2100500] and the NSFC No.61827805 and No.61861166002.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yebin Liu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 98705 KB)

Supplementary material 2 (pdf 3350 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, L., Zhao, X., Yu, T., Wang, S., Liu, Y. (2020). NormalGAN: Learning Detailed 3D Human from a Single RGB-D Image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12365. Springer, Cham. https://doi.org/10.1007/978-3-030-58565-5_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58565-5_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58564-8

  • Online ISBN: 978-3-030-58565-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics