DiffuStereo: High Quality Human Reconstruction via Diffusion-Based Stereo Using Sparse Cameras

Shao, Ruizhi; Zheng, Zerong; Zhang, Hongwen; Sun, Jingxiang; Liu, Yebin

doi:10.1007/978-3-031-19824-3_41

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13692))

Included in the following conference series:

European Conference on Computer Vision

2781 Accesses
12 Citations

Abstract

We propose DiffuStereo, a novel system using only sparse cameras (8 in this work) for high-quality 3D human reconstruction. At its core is a novel diffusion-based stereo module, which introduces diffusion models, a type of powerful generative models, into the iterative stereo matching network. To this end, we design a new diffusion kernel and additional stereo constraints to facilitate stereo matching and depth estimation in the network. We further present a multi-level stereo network architecture to handle high-resolution (up to 4k) inputs without requiring unaffordable memory footprint. Given a set of sparse-view color images of a human, the proposed multi-level diffusion-based stereo network can produce highly accurate depth maps, which are then converted into a high-quality 3D human model through an efficient multi-view fusion strategy. Overall, our method enables automatic reconstruction of human models with quality on par to high-end dense-view camera rigs, and this is achieved using a much more light-weight hardware setup. Experiments show that our method outperforms state-of-the-art methods by a large margin both qualitatively and quantitatively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

4DViews. http://www.4dviews.com/
8i. https://8i.com/
Alldieck, T., Magnor, M., Xu, W., Theobalt, C., Pons-Moll, G.: Detailed human avatars from monocular video. In: 3DV, September 2018
Google Scholar
Alldieck, T., Magnor, M., Xu, W., Theobalt, C., Pons-Moll, G.: Video based reconstruction of 3D people models. In: CVPR, June 2018
Google Scholar
Alldieck, T., Pons-Moll, G., Theobalt, C., Magnor, M.: Tex2Shape: detailed full human body geometry from a single image. In: ICCV, pp. 2293–2303 (2019)
Google Scholar
Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B.: PatchMatch: a randomized correspondence algorithm for structural image editing. ACM TOG 28, 24 (2009)
Article Google Scholar
Bogo, F., Black, M.J., Loper, M., Romero, J.: Detailed full-body reconstructions of moving people from monocular RGB-D sequences. In: ICCV, pp. 2300–2308 (2015)
Google Scholar
Bradley, D., Popa, T., Sheffer, A., Heidrich, W., Boubekeur, T.: Markerless garment capture. ACM TOG 27(3), 1–9 (2008)
Article Google Scholar
Chang, J.R., Chen, Y.S.: Pyramid stereo matching network. In: CVPR, pp. 5410–5418 (2018)
Google Scholar
Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Chan, W.: WaveGrad: estimating gradients for waveform generation. In: ICLR (2021)
Google Scholar
Collet, A., et al.: High-quality streamable free-viewpoint video. ACM TOG 34(4), 69 (2015)
Article Google Scholar
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS, vol. 34 (2021)
Google Scholar
Dou, M., et al.: Motion2Fusion: real-time volumetric performance capture. ACM TOG 36(6), 246:1–246:16 (2017)
Google Scholar
Dou, M., et al.: Fusion4D: real-time performance capture of challenging scenes. ACM TOG 35(4), 1–13 (2016)
Article Google Scholar
Fanello, S.R., et al.: UltraStereo: efficient learning-based matching for active stereo systems. In: CVPR, pp. 6535–6544 (2017)
Google Scholar
Gabeur, V., Franco, J.S., Martin, X., Schmid, C., Rogez, G.: Moulding humans: non-parametric 3D human shape estimation from single images. In: ICCV, pp. 2232–2241 (2019)
Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: CVPR (2012)
Google Scholar
Gilbert, A., Volino, M., Collomosse, J., Hilton, A.: Volumetric performance capture from minimal camera viewpoints. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 591–607. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_35
Chapter Google Scholar
Guo, K., et al.: The relightables: volumetric performance capture of humans with realistic relighting. ACM TOG 38(6), 1–19 (2019)
Google Scholar
Guo, X., Yang, K., Yang, W., Wang, X., Li, H.: Group-wise correlation stereo network. In: CVPR, pp. 3273–3282 (2019)
Google Scholar
Hannah, M.J.: Computer Matching of Areas in Stereo Images. Stanford University (1974)
Google Scholar
He, T., Xu, Y., Saito, S., Soatto, S., Tung, T.: Arch++: animation-ready clothed human reconstruction revisited. In: ICCV, pp. 11046–11056 (2021)
Google Scholar
Hirschmuller, H.: Stereo processing by semiglobal matching and mutual information. IEEE TPAMI 30(2), 328–341 (2008)
Article Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS, vol. 33, pp. 6840–6851 (2020)
Google Scholar
Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. arXiv preprint arXiv:2106.15282 (2021)
Hong, Y., Zhang, J., Jiang, B., Guo, Y., Liu, L., Bao, H.: StereoPIFu: depth aware clothed human digitization via stereo vision. In: CVPR (2021)
Google Scholar
Huang, Z., et al.: Deep volumetric video from very sparse multi-view performance capture. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 351–369. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_21
Chapter Google Scholar
Huang, Z., Xu, Y., Lassner, C., Li, H., Tung, T.: Arch: animatable reconstruction of clothed humans. In: CVPR (2020)
Google Scholar
Jensen, R., Dahl, A., Vogiatzis, G., Tola, E., Aanæs, H.: Large scale multi-view stereopsis evaluation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 406–413. IEEE (2014)
Google Scholar
Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: ICCV (2015)
Google Scholar
Kazhdan, M., Bolitho, M., Hoppe, H.: Poisson surface reconstruction. In: ESGP, vol. 7 (2006)
Google Scholar
Kendall, A., et al.: End-to-end learning of geometry and context for deep stereo regression. In: ICCV, pp. 66–75 (2017)
Google Scholar
Kolmogorov, V.: Convergent tree-reweighted message passing for energy minimization. IEEE TPAMI 28(10), 1568–1583 (2006)
Article Google Scholar
Li, H., et al.: SRDiff: single image super-resolution with diffusion probabilistic models. Neurocomputing 479, 47–59 (2022)
Article Google Scholar
Li, J., et al.: Practical stereo matching via cascaded recurrent network with adaptive correlation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16263–16272 (2022)
Google Scholar
Li, Z., Yu, T., Zheng, Z., Guo, K., Liu, Y.: POSEFusion: pose-guided selective fusion for single-view human volumetric capture. In: CVPR (2021)
Google Scholar
Lipson, L., Teed, Z., Deng, J.: Raft-stereo: multilevel recurrent field transforms for stereo matching. In: 3DV, pp. 218–227 (2021)
Google Scholar
Liu, Y., Cao, X., Dai, Q., Xu, W.: Continuous depth estimation for multi-view stereo. In: CVPR, pp. 2121–2128 (2009)
Google Scholar
Liu, Y., Dai, Q., Xu, W.: A point-cloud-based multiview stereo algorithm for free-viewpoint video. IEEE TVCG 16(3), 407–418 (2009)
Google Scholar
Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: CVPR, pp. 4040–4048 (2016)
Google Scholar
Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: CVPR (2015)
Google Scholar
Natsume, R., et al.: SiCloPe: silhouette-based clothed people. In: CVPR. pp. 4480–4490 (2019)
Google Scholar
Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: ICML, pp. 8162–8171 (2021)
Google Scholar
Pons-Moll, G., Pujades, S., Hu, S., Black, M.J.: ClothCap: seamless 4D clothing capture and retargeting. ACM TOG 36(4), 1–15 (2017)
Article Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. arXiv preprint arXiv:2112.10752 (2021)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Saharia, C., et al.: Palette: image-to-image diffusion models. In: NeurIPS Workshop (2021)
Google Scholar
Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D.J., Norouzi, M.: Image super-resolution via iterative refinement. arXiv:2104.07636 (2021)
Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. In: ICCV, pp. 2304–2314 (2019)
Google Scholar
Saito, S., Simon, T., Saragih, J., Joo, H.: PIFuHD: multi-level pixel-aligned implicit function for high-resolution 3D human digitization. In: CVPR, pp. 84–93 (2020)
Google Scholar
Shao, R., et al.: DoubleField: bridging the neural surface and radiance fields for high-fidelity human reconstruction and rendering. In: CVPR (2022)
Google Scholar
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML, pp. 2256–2265 (2015)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations, ICLR (2021)
Google Scholar
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations, ICLR (2021)
Google Scholar
Starck, J., Hilton, A.: Surface capture for performance-based animation. IEEE Comput. Graphics Appl. 27(3), 21–31 (2007)
Article Google Scholar
Twindom (2020). https://web.twindom.com
Vlasic, D., et al.: Dynamic shape capture using multi-view photometric stereo. ACM TOG 28(5), 174:1–174:11 (2009)
Google Scholar
Wang, F., Galliani, S., Vogel, C., Pollefeys, M.: IterMVS: iterative probability estimation for efficient multi-view stereo. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8606–8615 (2022)
Google Scholar
Wang, L., Zhao, X., Yu, T., Wang, S., Liu, Y.: NormalGAN: learning detailed 3D human from a single RGB-D image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 430–446. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_26
Chapter Google Scholar
Wang, S., Li, B., Dai, Y.: Efficient multi-view stereo by iterative dynamic cost volume. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8655–8664 (2022)
Google Scholar
Wu, C., Varanasi, K., Liu, Y., Seidel, H., Theobalt, C.: Shading-based dynamic shape refinement from multi-view video under general illumination. In: ICCV, pp. 1108–1115 (2011)
Google Scholar
Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: MVSNet: depth inference for unstructured multi-view stereo. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 785–801. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01237-3_47
Chapter Google Scholar
Yu, T., et al.: BodyFusion: real-time capture of human motion and surface geometry using a single depth camera. In: ICCV, pp. 910–919. IEEE (2017)
Google Scholar
Yu, T., Zheng, Z., Guo, K., Liu, P., Dai, Q., Liu, Y.: Function4D: real-time human volumetric capture from very sparse consumer RGBD sensors. In: CVPR, pp. 5746–5756 (2021)
Google Scholar
Yu, T., et al.: DoubleFusion: real-time capture of human performances with inner body shapes from a single depth sensor. In: CVPR, pp. 7287–7296. IEEE (2018)
Google Scholar
Zabih, R., Woodfill, J.: Non-parametric local transforms for computing visual correspondence. In: Eklundh, J.-O. (ed.) ECCV 1994. LNCS, vol. 801, pp. 151–158. Springer, Heidelberg (1994). https://doi.org/10.1007/BFb0028345
Chapter Google Scholar
Zhang, F., Prisacariu, V., Yang, R., Torr, P.H.: GA-Net: guided aggregation net for end-to-end stereo matching. In: CVPR, pp. 185–194 (2019)
Google Scholar
Zhang, F., Qi, X., Yang, R., Prisacariu, V., Wah, B., Torr, P.: Domain-invariant stereo matching networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 420–439. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_25
Chapter Google Scholar
Zhang, Y., et al.: Adaptive unimodal cost volume filtering for deep stereo matching. In: AAAI, vol. 34, pp. 12926–12934 (2020)
Google Scholar
Zheng, Y., et al.: DeepMultiCap: performance capture of multiple characters using sparse multiview cameras. In: ICCV (2021)
Google Scholar
Zheng, Z.: HybridFusion: real-time performance capture using a single depth sensor and sparse IMUs. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 389–406. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_24
Chapter Google Scholar
Zheng, Z., Yu, T., Liu, Y., Dai, Q.: PaMIR: parametric model-conditioned implicit representation for image-based human reconstruction. IEEE TPAMI 44(6), 3170–3184 (2021)
Article Google Scholar
Zhu, H., Zuo, X., Wang, S., Cao, X., Yang, R.: Detailed human shape estimation from a single image by hierarchical mesh deformation. In: CVPR, pp. 4491–4500 (2019)
Google Scholar
Žbontar, J., LeCun, Y.: Computing the stereo matching cost with a convolutional neural network. In: CVPR, pp. 1592–1599 (2015)
Google Scholar

Download references

Acknowledgements

This paper is supported by National Key R &D Program of China (2021ZD0113501) and the NSFC project No. 62125107 and No. 61827805.

Author information

Authors and Affiliations

Tsinghua University, Beijing, China
Ruizhi Shao, Zerong Zheng, Hongwen Zhang, Jingxiang Sun & Yebin Liu

Authors

Ruizhi Shao
View author publications
You can also search for this author in PubMed Google Scholar
Zerong Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Hongwen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jingxiang Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yebin Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yebin Liu .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4780 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shao, R., Zheng, Z., Zhang, H., Sun, J., Liu, Y. (2022). DiffuStereo: High Quality Human Reconstruction via Diffusion-Based Stereo Using Sparse Cameras. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13692. Springer, Cham. https://doi.org/10.1007/978-3-031-19824-3_41

Download citation

DOI: https://doi.org/10.1007/978-3-031-19824-3_41
Published: 11 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19823-6
Online ISBN: 978-3-031-19824-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DiffuStereo: High Quality Human Reconstruction via Diffusion-Based Stereo Using Sparse Cameras