Abstract
3D virtual try-on based on a single image can provide an excellent shopping experience for Internet users and has enormous business potential. The existing methods of processing the clothed 3D human body generated from the virtual try-on images are reconstructed in 3D models by extracting the depth information from the input images. However, the generated results are unstable and often fail to capture the high-frequency information loss detail features in the larger spatial background during the process of downsampling for depth prediction, and the loss of the generator gradient when predicting the occluded areas in the high-resolution images. To address this problem, we propose a multi-resolution parallel approach to obtain low-frequency information and retain as much of the high-frequency depth features in the images during depth prediction; at the same time, we use a multi-scale generator and discriminator to more accurately infer the feature images of the occluded regions to generate a fine-grained dressed 3D human body. Our method not only provides better details and effects to the final 3D mannequin generation for 3D virtual fitting, but also significantly improves the user’s try-on experience than previous studies, as evidenced by our higher quantitative and qualitative evaluations.
Similar content being viewed by others
Data Availability
The data that support the findings of this study are available from the corresponding author, [author initials], upon reasonable request.
References
Peng, S., Dong, J., Wang, Q., Zhang, S., Shuai, Q., Bao, H., Zhou, X.: Animatable neural radiance fields for human body modeling. arXiv preprint arXiv:2105.02872 (2021)
Jiang, Y., Jiang, S., Sun, G., Su, Z., Guo, K., Wu, M., Yu, J., Xu, L.: Neuralhofusion: Neural volumetric rendering under human-object interactions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6155–6165 (2022)
Zhi, Y., Qian, S., Yan, X., Gao, S.: Dual-space nerf: Learning animatable avatars and scene lighting in separate spaces. arXiv preprint arXiv:2208.14851 (2022)
Wang, S., Schwarz, K., Geiger, A., Tang, S.: Arah: animatable volume rendering of articulated human sdfs. In: European Conference on Computer Vision, pp. 1–19 (2022). Springer
Weng, C.-Y., Curless, B., Srinivasan, P.P., Barron, J.T., Kemelmacher-Shlizerman, I.: Humannerf: Free-viewpoint rendering of moving people from monocular video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16210–16220 (2022)
Te, G., Li, X., Li, X., Wang, J., Hu, W., Lu, Y.: Neural capture of animatable 3d human from monocular video. In: European Conference on Computer Vision, pp. 275–291 (2022). Springer
Jafarian, Y., Park, H.S.: Learning high fidelity depths of dressed humans by watching social media dance videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12753–12762 (2021)
Ma, Q., Yang, J., Black, M.J., Tang, S.: Neural point-based shape modeling of humans in challenging clothing. arXiv preprint arXiv:2209.06814 (2022)
Shao, R., Zheng, Z., Zhang, H., Sun, J., Liu, Y.: Diffustereo: High quality human reconstruction via diffusion-based stereo using sparse cameras. In: European Conference on Computer Vision, pp. 702–720 (2022). Springer
Hong, F., Pan, L., Cai, Z., Liu, Z.: Garment4d: Garment reconstruction from point cloud sequences. Adv. Neural. Inf. Process. Syst. 34, 27940–27951 (2021)
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: A skinned multi-person linear model. ACM transactions on graphics (TOG) 34(6), 1–16 (2015)
Patel, C., Liao, Z., Pons-Moll, G.: Tailornet: Predicting clothing in 3d as a function of human pose, shape and garment style. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7365–7375 (2020)
Corona, E., Pumarola, A., Alenya, G., Pons-Moll, G., Moreno-Noguer, F.: Smplicit: Topology-aware generative model for clothed people. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11875–11885 (2021)
He, T., Xu, Y., Saito, S., Soatto, S., Tung, T.: Arch++: Animation-ready clothed human reconstruction revisited. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11046–11056 (2021)
Bhatnagar, B.L., Tiwari, G., Theobalt, C., Pons-Moll, G.: Multi-garment net: Learning to dress 3d people from images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5420–5430 (2019)
Zhao, F., Xie, Z., Kampffmeyer, M., Dong, H., Han, S., Zheng, T., Zhang, T., Liang, X.: M3d-vton: A monocular-to-3d virtual try-on network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13239–13249 (2021)
Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1511–1520 (2017)
Saito, S., Simon, T., Saragih, J., Joo, H.: Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 84–93 (2020)
Jiang, B., Zhang, J., Hong, Y., Luo, J., Liu, L., Bao, H.: Bcnet: Learning body and cloth shape from a single image. In: European Conference on Computer Vision, pp. 18–35 (2020). Springer
Xiu, Y., Yang, J., Tzionas, D., Black, M.J.: Icon: implicit clothed humans obtained from normals. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13286–13296 (2022). IEEE
Zhu, H., Cao, Y., Jin, H., Chen, W., Du, D., Wang, Z., Cui, S., Han, X.: Deep fashion3d: A dataset and benchmark for 3d garment reconstruction from single images. In: European Conference on Computer Vision, pp. 512–530 (2020). Springer
Lei, J., Sridhar, S., Guerrero, P., Sung, M., Mitra, N., Guibas, L.J.: Pix2surf: Learning parametric 3d surface models of objects from images. In: European Conference on Computer Vision, pp. 121–138 (2020). Springer
Petrov, A.: On obtaining shape from color shading. Color Res. Appl. 18(6), 375–379 (1993)
Zhang, R., Tsai, P.-S., Cryer, J.E., Shah, M.: Shape-from-shading: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 21(8), 690–706 (1999)
Yan, S., Wu, C., Wang, L., Xu, F., An, L., Guo, K., Liu, Y.: Ddrnet: Depth map denoising and refinement for consumer depth cameras using cascaded CNNS. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 151–167 (2018)
Sterzentsenko, V., Saroglou, L., Chatzitofis, A., Thermos, S., Zioulis, N., Doumanoglou, A., Zarpalas, D., Daras, P.: Self-supervised deep depth denoising. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1242–1251 (2019)
Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., Bengio, Y.: Generative adversarial nets. In: NIPS (2014)
Park, T., Liu, M.-Y., Wang, T.-C., Zhu, J.-Y.: Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2337–2346 (2019)
Kazhdan, M., Hoppe, H.: Screened Poisson surface reconstruction. ACM Trans. Graph. 32(3), 1–13 (2013)
Telea, A.: An image inpainting technique based on the fast marching method. J. Gr. Tools 9(1), 23–34 (2004)
Wang, L., Zhao, X., Yu, T., Wang, S., Liu, Y.: Normalgan: learning detailed 3D human from a single RGB-d image. In: European Conference on Computer Vision, pp. 430–446 (2020). Springer
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 30 (2017)
Wang, B., Zheng, H., Liang, X., Chen, Y., Lin, L., Yang, M.: Toward characteristic-preserving image-based virtual try-on network. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 589–604 (2018)
Zheng, Z., Yu, T., Wei, Y., Dai, Q., Liu, Y.: Deephuman: 3D human reconstruction from a single image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7739–7749 (2019)
Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2304–2314 (2019)
Author information
Authors and Affiliations
Contributions
All authors disclosed no relevant relationships. XH, CZ, JH, RL, JL, TP.
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hu, X., Zheng, C., Huang, J. et al. Cloth texture preserving image-based 3D virtual try-on. Vis Comput 39, 3347–3357 (2023). https://doi.org/10.1007/s00371-023-02999-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-023-02999-4