Advertisement

NormalGAN: Learning Detailed 3D Human from a Single RGB-D Image

Conference paper
  • 540 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12365)

Abstract

We propose NormalGAN, a fast adversarial learning-based method to reconstruct the complete and detailed 3D human from a single RGB-D image. Given a single front-view RGB-D image, NormalGAN performs two steps: front-view RGB-D rectification and back-view RGB-D inference. The final model was then generated by simply combining the front-view and back-view RGB-D information. However, inferring back-view RGB-D image with high-quality geometric details and plausible texture is not trivial. Our key observation is: Normal maps generally encode much more information of 3D surface details than RGB and depth images. Therefore, learning geometric details from normal maps is superior than other representations. In NormalGAN, an adversarial learning framework conditioned by normal maps is introduced, which is used to not only improve the front-view depth denoising performance, but also infer the back-view depth image with surprisingly geometric details. Moreover, for texture recovery, we remove shading information from the front-view RGB image based on the refined normal map, which further improves the quality of the back-view color inference. Results and experiments on both testing data set and real captured data demonstrate the superior performance of our approach. Given a consumer RGB-D sensor, NormalGAN can generate the complete and detailed 3D human reconstruction results in 20 fps, which further enables convenient interactive experiences in telepresence, AR/VR and gaming scenarios.

Keywords

3D human reconstruction Single-view 3D reconstruction Single-image 3D reconstruction Generation and adversarial networks. 

Notes

Acknowledgement

This paper is supported by the National Key Research and Development Program of China [2018YFB2100500] and the NSFC No.61827805 and No.61861166002.

Supplementary material

Supplementary material 1 (mp4 98705 KB)

504476_1_En_26_MOESM2_ESM.pdf (3.3 mb)
Supplementary material 2 (pdf 3350 KB)

References

  1. 1.
    Cui, Y., Chang, W., Nöll, T., Stricker, D.: KinectAvatar: fully automatic body capture using a single kinect. In: Park, J.-I., Kim, J. (eds.) ACCV 2012. LNCS, vol. 7729, pp. 133–147. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-37484-5_12CrossRefGoogle Scholar
  2. 2.
    Dou, M., et al.: Fusion4D: real-time performance capture of challenging scenes. ACM Trans. Graph. 35(4) (2016).  https://doi.org/10.1145/2897824.2925969
  3. 3.
    Fankhauser, P., Bloesch, M., Rodriguez, D., Kaestner, R., Hutter, M., Siegwart, R.: Kinect v2 for mobile robot navigation: Evaluation and modeling. In: 2015 International Conference on Advanced Robotics (ICAR), pp. 388–394. IEEE (2015)Google Scholar
  4. 4.
    Gabeur, V., Franco, J., Martin, X., Schmid, C., Rogez, G.: Moulding humans: non-parametric 3D human shape estimation from single images. CoRR abs/1908.00439 (2019). http://arxiv.org/abs/1908.00439
  5. 5.
    Guan, P., Weiss, A., Balan, A., Black, M.J.: Estimating human shape and pose from a single image. In: International Conference on Computer Vision, ICCV, pp. 1381–1388 (2009)Google Scholar
  6. 6.
    Guo, K., Xu, F., Yu, T., Liu, X., Dai, Q., Liu, Y.: Real-time geometry, albedo, and motion reconstruction using a single RGB-D camera. ACM Trans. Graph. 36(3), 32:1–32:13 (2017).  https://doi.org/10.1145/3083722
  7. 7.
    Han, Y., Lee, J., Kweon, I.S.: High quality shape from a single RGB-D image under uncalibrated natural illumination. In: International Conference on Computer Vision, pp. 1617–1624 (2013)Google Scholar
  8. 8.
    Innmann, M., Zollhöfer, M., Nießner, M., Theobalt, C., Stamminger, M.: VolumeDeform: real-time volumetric non-rigid reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 362–379. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_22CrossRefGoogle Scholar
  9. 9.
    Izadi, S., et al.: KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. In: Proceedings UIST, pp. 559–568. ACM (2011)Google Scholar
  10. 10.
    Johnson, J., Alahi, A., Feifei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Computer Vision and Pattern Recognition (2016)Google Scholar
  11. 11.
    Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Computer Vision and Pattern Recognition, pp. 7122–7131 (2018)Google Scholar
  12. 12.
    Li, C., Zhao, Z., Guo, X.: ArticulatedFusion: real-time reconstruction of motion, geometry and segmentation using a single depth camera. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 324–340. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01237-3_20CrossRefGoogle Scholar
  13. 13.
    Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34(6), 248:1–248:16 (2015)Google Scholar
  14. 14.
    Natsume, R., et al.: SiCloPe: silhouette-based clothed people. In: Computer Vision and Pattern Recognition, pp. 4480–4490 (2019)Google Scholar
  15. 15.
    Newcombe, R.A., Fox, D., Seitz, S.M.: DynamicFusion: reconstruction and tracking of non-rigid scenes in real-time. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 343–352. IEEE, Boston (2015)Google Scholar
  16. 16.
    Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. http://smpl-x.is.tue.mpg.de
  17. 17.
    Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3D human pose and shape from a single color image. In: Computer Vision and Pattern Recognition, pp. 459–468 (2018)Google Scholar
  18. 18.
    Petrov, A.P.: On obtaining shape from color shading. Color Res. Appl. 18(6), 375–379 (1993)CrossRefGoogle Scholar
  19. 19.
    Rhodin, H., Robertini, N., Casas, D., Richardt, C., Seidel, H., Theobalt, C.: General automatic human shape and motion capture using volumetric contour cues. In: Computer Vision and Pattern Recognition (2016)Google Scholar
  20. 20.
    Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Computer Vision and Pattern Recognition (2015)Google Scholar
  21. 21.
    Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: PIFU: pixel-aligned implicit function for high-resolution clothed human digitization. CoRR abs/1905.05172 (2019). http://arxiv.org/abs/1905.05172
  22. 22.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Computer Vision and Pattern Recognition (2014)Google Scholar
  23. 23.
    Slavcheva, M., Baust, M., Cremers, D., Ilic, S.: KillingFusion: non-rigid 3D reconstruction without correspondences. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5474–5483. IEEE, Honolulu (2017)Google Scholar
  24. 24.
    Slavcheva, M., Baust, M., Ilic, S.: SobolevFusion: 3D reconstruction of scenes undergoing free non-rigid motion. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2646–2655. IEEE, Salt Lake City, June 2018Google Scholar
  25. 25.
    Smith, D., Loper, M., Hu, X., Mavroidis, P., Romero, J.: FACSIMILE: fast and accurate scans from an image in less than a second. In: The IEEE International Conference on Computer Vision (ICCV), pp. 5330–5339 (2019)Google Scholar
  26. 26.
    Sterzentsenko, V., et al.: Self-supervised deep depth denoising. arXiv: Computer Vision and Pattern Recognition (2019)
  27. 27.
    Tong, J., Zhou, J., Liu, L., Pan, Z., Yan, H.: Scanning 3D full human bodies using Kinects. IEEE Trans. Visual Comput. Graphics 18(4), 643–650 (2012)CrossRefGoogle Scholar
  28. 28.
    Varol, G., et al.: BodyNet: volumetric inference of 3d human body shapes. In: Computer Vision and Pattern Recognition (2018)Google Scholar
  29. 29.
    Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807 (2018)Google Scholar
  30. 30.
    Wu, C., Varanasi, K., Liu, Y., Seidel, H., Theobalt, C.: Shading-based dynamic shape refinement from multi-view video under general illumination. In: 2011 International Conference on Computer Vision, pp. 1108–1115, November 2011.  https://doi.org/10.1109/ICCV.2011.6126358
  31. 31.
    Wu, C., Stoll, C., Valgaerts, L., Theobalt, C.: On-set performance capture of multiple actors with a stereo camera. ACM Trans. Graph. 32, 1–11 (2013).  https://doi.org/10.1145/2508363.2508418
  32. 32.
    Yan, S., et al.: DDRNet: depth map denoising and refinement for consumer depth cameras using cascaded CNNs. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 155–171. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01249-6_10CrossRefGoogle Scholar
  33. 33.
    Yu, L., Yeung, S., Tai, Y., Lin, S.: Shading-based shape refinement of RGB-D images. In: Computer Vision and Pattern Recognition, pp. 1415–1422 (2013)Google Scholar
  34. 34.
    Yu, T., et al.: BodyFusion: real-time capture of human motion and surface geometry using a single depth camera. In: IEEE International Conference on Computer Vision (ICCV), pp. 910–919. IEEE, Venice (2017)Google Scholar
  35. 35.
    Yu, T., et al.: DoubleFusion: real-time capture of human performances with inner body shapes from a single depth sensor. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018, pp. 7287–7296 (2018)Google Scholar
  36. 36.
    Zeng, M., Zheng, J., Cheng, X., Liu, X.: Templateless quasi-rigid shape modeling with implicit loop-closure. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013Google Scholar
  37. 37.
    Zhang, R., Tsai, P., Cryer, J.E., Shah, M.: Shape-from-shading: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 21(8), 690–706 (1999)CrossRefGoogle Scholar
  38. 38.
    Zheng, Z., Yu, T., Wei, Y., Dai, Q., Liu, Y.: DeepHuman: 3D human reconstruction from a single image. CoRR abs/1903.06473 (2019). http://arxiv.org/abs/1903.06473

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

Personalised recommendations