Abstract
We present a novel method to improve the accuracy of the 3D reconstruction of clothed human shape from a single image. Recent work has introduced volumetric, implicit and model-based shape learning frameworks for reconstruction of objects and people from one or more images. However, the accuracy and completeness for reconstruction of clothed people is limited due to the large variation in shape resulting from clothing, hair, body size, pose and camera viewpoint. This paper introduces two advances to overcome this limitation: firstly a new synthetic dataset of realistic clothed people, 3DVH; and secondly, a novel multiple-view loss function for training of monocular volumetric shape estimation, which is demonstrated to significantly improve generalisation and reconstruction accuracy. The 3DVH dataset of realistic clothed 3D human models rendered with diverse natural backgrounds is demonstrated to allows transfer to reconstruction from real images of people. Comprehensive comparative performance evaluation on both synthetic and real images of people demonstrates that the proposed method significantly outperforms the previous state-of-the-art learning-based single image 3D human shape estimation approaches achieving significant improvement of reconstruction accuracy, completeness, and quality. An ablation study shows that this is due to both the proposed multiple-view training and the new 3DVH dataset. The code and the dataset can be found at the project website: https://akincaliskan3d.github.io/MV3DH/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Guo, K., et al.: The relightables: volumetric performance capture of humans with realistic relighting. ACM Trans. Graph. (TOG) 38, 1–19 (2019)
Mustafa, A., Russell, C., Hilton, A.: U4D: unsupervised 4D dynamic scene understanding. In: ICCV (2019)
Dong, H., et al.: Towards multi-pose guided virtual try-on network. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9026–9035 (2019)
Liu, L., et al.: Neural rendering and reenactment of human actor videos (2018)
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: CVPR (2017)
Alp Güler, R., Neverova, N., Kokkinos, I.: DensePose: dense human pose estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7297–7306 (2018)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Yang, L., Song, Q., Wang, Z., Jiang, M.: Parsing R-CNN for instance-level human analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 364–373 (2019)
Kocabas, M., Karagoz, S., Akbas, E.: Self-supervised learning of 3D human pose using multi-view geometry. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1077–1086 (2019)
Xiang, D., Joo, H., Sheikh, Y.: Monocular total capture: posing face, body, and hands in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10965–10974 (2019)
Tome, D., Russell, C., Agapito, L.: Lifting from the deep: convolutional 3D pose estimation from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2500–2509 (2017)
Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3D human pose and shape from a single color image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 459–468 (2018)
Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2304–2314 (2019)
Zheng, Z., Yu, T., Wei, Y., Dai, Q., Liu, Y.: DeepHuman: 3D human reconstruction from a single image. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7739–7749 (2019)
Jackson, A.S., Manafas, C., Tzimiropoulos, G.: 3D human body reconstruction from a single image via volumetric regression. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Mustafa, A., Kim, H., Guillemaut, J.Y., Hilton, A.: General dynamic scene reconstruction from multiple view video. In: The IEEE International Conference on Computer Vision (ICCV) (2015)
Leroy, V., Franco, J.S., Boyer, E.: Shape reconstruction using volume sweeping and learned photoconsistency. In: The European Conference on Computer Vision (ECCV) (2018)
Gilbert, A., Volino, M., Collomosse, J., Hilton, A.: Volumetric performance capture from minimal camera viewpoints. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 566–581 (2018)
Varol, G., et al.: BodyNet: volumetric inference of 3D human body shapes. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 20–36 (2018)
Yu, T., et al.: DoubleFusion: real-time capture of human performances with inner body shapes from a single depth sensor. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7287–7296 (2018)
Caliskan, A., Mustafa, A., Imre, E., Hilton, A.: Learning dense wide baseline stereo matching for people. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (2019)
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep It SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_34
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7122–7131 (2018)
Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2252–2261 (2019)
Bhatnagar, B.L., Tiwari, G., Theobalt, C., Pons-Moll, G.: Multi-Garment Net: learning to dress 3D people from images. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5419–5429 (2019)
Natsume, R., et al.: SiCloPe: silhouette-based clothed people. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4480–4490 (2019)
Gabeur, V., Franco, J.S., Martin, X., Schmid, C., Rogez, G.: Moulding humans: non-parametric 3D human shape estimation from single images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2232–2241 (2019)
Pumarola, A., Sanchez, J., Choi, G., Sanfeliu, A., Moreno-Noguer, F.: 3DPeople: modeling the geometry of dressed humans. In: International Conference on Computer Vision (ICCV) (2019)
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. (Proc. SIGGRAPH Asia) 34, 248:1–248:16 (2015)
Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: SCAPE: shape completion and animation of people. In: ACM SIGGRAPH 2005 Papers, pp. 408–416 (2005)
Ma, Q., et al.: Learning to dress 3D people in generative clothing. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Yu, T., et al.: SimulCap: single-view human performance capture with cloth simulation. In: 32nd IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA. IEEE (2019)
Alldieck, T., Pons-Moll, G., Theobalt, C., Magnor, M.: Tex2Shape: detailed full human body geometry from a single image. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2293–2303 (2019)
Multiview video repository: Center for Vision Speech and Signal Processing (CVSSP). University of Surrey, UK (2020). https://cvssp.org/data/cvssp3d/
Vlasic, D., Baran, I., Matusik, W., Popović, J.: Articulated mesh animation from multi-view silhouettes. In: ACM SIGGRAPH 2008 Papers, pp. 1–9 (2008)
Varol, G., et al.: Learning from synthetic humans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 109–117 (2017)
Yang, J., Franco, J.-S., Hétroy-Wheeler, F., Wuhrer, S.: Estimation of human body shape in motion with wide clothing. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 439–454. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_27
Xu, W., et al.: MonoPerfCap: human performance capture from monocular video. ACM Trans. Graph. 37, 27:1–27:15 (2018)
Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: closing the loop between 3D and 2D human representations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6050–6059 (2017)
Tulsiani, S., Efros, A.A., Malik, J.: Multi-view consistency as supervisory signal for learning shape and pose prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2897–2905 (2018)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Wu, Y., He, K.: Group normalization. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
Jackson, A.S., Bulat, A., Argyriou, V., Tzimiropoulos, G.: Large pose 3D face reconstruction from a single image via direct volumetric CNN regression. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1031–1039 (2017)
Adobe: Fuse (2020). https://www.adobe.com/products/fuse.html
Adobe: Mixamo (2020). https://www.mixamo.com/
HDRI: Heaven (2020). https://hdrihaven.com/
Jatavallabhula, K.M., et al.: Kaolin: a PyTorch library for accelerating 3D deep learning research. arXiv:1911.05063 (2019)
Huang, Z., Xu, Y., Lassner, C., Li, H., Tung, T.: ARCH: animatable reconstruction of clothed humans. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary material 2 (mp4 14352 KB)
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Caliskan, A., Mustafa, A., Imre, E., Hilton, A. (2021). Multi-view Consistency Loss for Improved Single-Image 3D Reconstruction of Clothed People. In: Ishikawa, H., Liu, CL., Pajdla, T., Shi, J. (eds) Computer Vision – ACCV 2020. ACCV 2020. Lecture Notes in Computer Science(), vol 12622. Springer, Cham. https://doi.org/10.1007/978-3-030-69525-5_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-69525-5_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-69524-8
Online ISBN: 978-3-030-69525-5
eBook Packages: Computer ScienceComputer Science (R0)