TexMesh: Reconstructing Detailed Human Texture and Geometry from RGB-D Video

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12355)


We present TexMesh, a novel approach to reconstruct detailed human meshes with high-resolution full-body texture from RGB-D video. TexMesh enables high quality free-viewpoint rendering of humans. Given the RGB frames, the captured environment map, and the coarse per-frame human mesh from RGB-D tracking, our method reconstructs spatiotemporally consistent and detailed per-frame meshes along with a high-resolution albedo texture. By using the incident illumination we are able to accurately estimate local surface geometry and albedo, which allows us to further use photometric constraints to adapt a synthetically trained model to real-world sequences in a self-supervised manner for detailed surface geometry and high-resolution texture estimation. In practice, we train our models on a short example sequence for self-adaptation and the model runs at interactive framerate afterwards. We validate TexMesh on synthetic and real-world data, and show it outperforms the state of art quantitatively and qualitatively.


Human shape reconstruction Human texture generation 



This work was done during Tiancheng Zhi’s internship at Facebook Reality Labs, Sausalito, CA, USA. We thank Junbang Liang, Yinghao Huang, and Nikolaos Sarafianos for their help with data generation.

Supplementary material

504449_1_En_29_MOESM2_ESM.pdf (2.4 mb)
Supplementary material 2 (pdf 2418 KB)


  1. 1.
    Alldieck, T., Magnor, M., Bhatnagar, B.L., Theobalt, C., Pons-Moll, G.: Learning to reconstruct people in clothing from a single RGB camera. In: CVPR (2019)Google Scholar
  2. 2.
    Alldieck, T., Magnor, M., Xu, W., Theobalt, C., Pons-Moll, G.: Detailed human avatars from monocular video. In: 3DV (2018)Google Scholar
  3. 3.
    Alldieck, T., Magnor, M., Xu, W., Theobalt, C., Pons-Moll, G.: Video based reconstruction of 3D people models. In: CVPR (2018)Google Scholar
  4. 4.
    Alldieck, T., Pons-Moll, G., Theobalt, C., Magnor, M.: Tex2shape: detailed full human body geometry from a single image. In: CVPR (2019)Google Scholar
  5. 5.
    Barron, J.T.: A general and adaptive robust loss function. In: CVPR (2019)Google Scholar
  6. 6.
    Bhatnagar, B.L., Tiwari, G., Theobalt, C., Pons-Moll, G.: Multi-Garment Net: learning to dress 3D people from images. In: ICCV (2019)Google Scholar
  7. 7.
    Blinn, J.F., Newell, M.E.: Texture and reflection in computer generated images. Commun. ACM 19(10), 542–547 (1976)CrossRefGoogle Scholar
  8. 8.
    Bogo, F., Black, M.J., Loper, M., Romero, J.: Detailed full-body reconstructions of moving people from monocular RGB-D sequences. In: ICCV (2015)Google Scholar
  9. 9.
    Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016). Scholar
  10. 10.
    Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: Openpose: realtime multi-person 2D pose estimation using part affinity fields. TPAMI (2019)Google Scholar
  11. 11.
    Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
  12. 12.
    Collet, A., et al.: High-quality streamable free-viewpoint video. TOG 34, 1–13 (2015)CrossRefGoogle Scholar
  13. 13.
    Gardner, M.A., et al.: Learning to predict indoor illumination from a single image. TOG (SIGGRAPH Asia) 9(4) (2017)Google Scholar
  14. 14.
    Grigorev, A., Sevastopolsky, A., Vakhitov, A., Lempitsky, V.: Coordinate-based texture inpainting for pose-guided human image generation. In: CVPR (2019)Google Scholar
  15. 15.
    Habermann, M., Xu, W., Zollhoefer, M., Pons-Moll, G., Theobalt, C.: Livecap: real-time human performance capture from monocular video. TOG 38(2), 1–17 (2019)CrossRefGoogle Scholar
  16. 16.
    Huang, Y., et al.: Towards accurate marker-less human shape and pose estimation over time. In: 3DV (2017)Google Scholar
  17. 17.
    Huang, Z., Xu, Y., Lassner, C., Li, H., Tung, T.: ARCH: animatable reconstruction of clothed humans. In: CVPR (2020)Google Scholar
  18. 18.
    Jain, A., Thormählen, T., Seidel, H.P., Theobalt, C.: MovieReshape: tracking and reshaping of humans in videos. TOG 29(6), 1–10 (2010)CrossRefGoogle Scholar
  19. 19.
    Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). Scholar
  20. 20.
    Kanade, T., Rander, P., Narayanan, P.: Virtualized reality: constructing virtual worlds from real scenes. IEEE Multimed. 4(1), 34–47 (1997)CrossRefGoogle Scholar
  21. 21.
    Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)Google Scholar
  22. 22.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  23. 23.
    Lähner, Z., Cremers, D., Tung, T.: DeepWrinkles: accurate and realistic clothing modeling. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 698–715. Springer, Cham (2018). Scholar
  24. 24.
    Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: closing the loop between 3D and 2D human representations. In: CVPR (2017)Google Scholar
  25. 25.
    Lengyel, E.: Mathematics for 3D Game Programming and Computer Graphics. Cengage Learning, Boston (2012)zbMATHGoogle Scholar
  26. 26.
    Li, H., Sumner, R.W., Pauly, M.: Global correspondence optimization for non-rigid registration of depth scans. In: CGF (2008)Google Scholar
  27. 27.
    Liu, S., Li, T., Chen, W., Li, H.: Soft Rasterizer: a differentiable renderer for image-based 3d reasoning. In: ICCV (2019)Google Scholar
  28. 28.
    Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. TOG 34(6), 1–16 (2015)CrossRefGoogle Scholar
  29. 29.
    Matsuyama, T., Takai, T.: Generation, visualization, and editing of 3D video. In: 3DPVT (2002)Google Scholar
  30. 30.
    Newcombe, R.A., Fox, D., Seitz, S.M.: DynamicFusion: reconstruction and tracking of non-rigid scenes in real-time. In: CVPR (2015)Google Scholar
  31. 31.
    Newcombe, R.A., et al.: KinectFusion: real-time dense surface mapping and tracking. In: 2011 10th IEEE International Symposium on Mixed and Augmented Reality, pp. 127–136. IEEE (2011)Google Scholar
  32. 32.
    Oechsle, M., Mescheder, L., Niemeyer, M., Strauss, T., Geiger, A.: Texture fields: learning texture representations in function space. In: ICCV (2019)Google Scholar
  33. 33.
    Omran, M., Lassner, C., Pons-Moll, G., Gehler, P., Schiele, B.: Neural body fitting: unifying deep learning and model based human pose and shape estimation. In: 3DV (2018)Google Scholar
  34. 34.
    Piccardi, M.: Background subtraction techniques: a review. In: 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No. 04CH37583), vol. 4, pp. 3099–3104. IEEE (2004)Google Scholar
  35. 35.
    Ramamoorthi, R., Hanrahan, P.: An efficient representation for irradiance environment maps. In: SIGGRAPH (2001)Google Scholar
  36. 36.
    Rhodin, H., Robertini, N., Casas, D., Richardt, C., Seidel, H.-P., Theobalt, C.: General automatic human shape and motion capture using volumetric contour cues. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 509–526. Springer, Cham (2016). Scholar
  37. 37.
    Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). Scholar
  38. 38.
    Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. In: ICCV (2019)Google Scholar
  39. 39.
    Sela, M., Richardson, E., Kimmel, R.: Unrestricted facial geometry reconstruction using image-to-image translation. In: ICCV (2017)Google Scholar
  40. 40.
    Sengupta, S., Kanazawa, A., Castillo, C.D., Jacobs, D.W.: SfSNet: learning shape, reflectance and illuminance of faces ‘in the wild’. In: CVPR (2018)Google Scholar
  41. 41.
    Shysheya, A., et al.: Textured neural avatars. In: CVPR (2019)Google Scholar
  42. 42.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)Google Scholar
  43. 43.
    Sorkine, O.: Differential representations for mesh processing. In: CGF (2006)Google Scholar
  44. 44.
    Tewari, A., et al.: FML: face model learning from videos. In: CVPR (2019)Google Scholar
  45. 45.
    Tewari, A., et al.: Self-supervised multi-level face model learning for monocular reconstruction at over 250 Hz. In: CVPR (2018)Google Scholar
  46. 46.
    Ulyanov, D., Vedaldi, A., Lempitsky, V.: Deep image prior. In: CVPR (2018)Google Scholar
  47. 47.
    Vlasic, D., Peers, P., Baran, I., Debevec, P., Popović, J., Rusinkiewicz, S., Matusik, W.: Dynamic shape capture using multi-view photometric stereo. TOG (SIGGRAPH Asia) 28(5) (2009)Google Scholar
  48. 48.
    Vo, M., Narasimhan, S.G., Sheikh, Y.: Spatiotemporal bundle adjustment for dynamic 3D reconstruction. In: CVPR (2016)Google Scholar
  49. 49.
    Walsman, A., Wan, W., Schmidt, T., Fox, D.: Dynamic high resolution deformable articulated tracking. In: 3DV (2017)Google Scholar
  50. 50.
    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. TIP 13(4), 600–612 (2004)Google Scholar
  51. 51.
    Xu, W., et al.: MonoPerfCap: human performance capture from monocular video. TOG 37, 27:1–27:15 (2018)Google Scholar
  52. 52.
    Xu, Y., Zhu, S.C., Tung, T.: DenseRaC: joint 3D pose and shape estimation by dense render and compare. In: ICCV (2019)Google Scholar
  53. 53.
    Yi, R., Zhu, C., Tan, P., Lin, S.: Faces as lighting probes via unsupervised deep highlight extraction. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 321–338. Springer, Cham (2018). Scholar
  54. 54.
    Yu, T., et al.: BodyFusion: real-time capture of human motion and surface geometry using a single depth camera. In: ICCV (2017)Google Scholar
  55. 55.
    Yu, T., et al.: DoubleFusion: real-time capture of human performances with inner body shapes from a single depth sensor. In: CVPR (2018)Google Scholar
  56. 56.
    Yu, T., et al.: SimulCap: single-view human performance capture with cloth simulation. In: CVPR (2019)Google Scholar
  57. 57.
    Zheng, Z., Yu, T., Wei, Y., Dai, Q., Liu, Y.: DeepHuman: 3D human reconstruction from a single image. In: ICCV (2019)Google Scholar
  58. 58.
    Zhou, S., Fu, H., Liu, L., Cohen-Or, D., Han, X.: Parametric reshaping of human bodies in images. TOG 29(4), 1–10 (2010)CrossRefGoogle Scholar
  59. 59.
    Zhu, H., Zuo, X., Wang, S., Cao, X., Yang, R.: Detailed human shape estimation from a single image by hierarchical mesh deformation. In: CVPR (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Carnegie Mellon UniversityPittsburghUSA
  2. 2.Facebook Reality LabsSausalitoUSA

Personalised recommendations