Abstract
We propose DiffuStereo, a novel system using only sparse cameras (8 in this work) for high-quality 3D human reconstruction. At its core is a novel diffusion-based stereo module, which introduces diffusion models, a type of powerful generative models, into the iterative stereo matching network. To this end, we design a new diffusion kernel and additional stereo constraints to facilitate stereo matching and depth estimation in the network. We further present a multi-level stereo network architecture to handle high-resolution (up to 4k) inputs without requiring unaffordable memory footprint. Given a set of sparse-view color images of a human, the proposed multi-level diffusion-based stereo network can produce highly accurate depth maps, which are then converted into a high-quality 3D human model through an efficient multi-view fusion strategy. Overall, our method enables automatic reconstruction of human models with quality on par to high-end dense-view camera rigs, and this is achieved using a much more light-weight hardware setup. Experiments show that our method outperforms state-of-the-art methods by a large margin both qualitatively and quantitatively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
4DViews. http://www.4dviews.com/
8i. https://8i.com/
Alldieck, T., Magnor, M., Xu, W., Theobalt, C., Pons-Moll, G.: Detailed human avatars from monocular video. In: 3DV, September 2018
Alldieck, T., Magnor, M., Xu, W., Theobalt, C., Pons-Moll, G.: Video based reconstruction of 3D people models. In: CVPR, June 2018
Alldieck, T., Pons-Moll, G., Theobalt, C., Magnor, M.: Tex2Shape: detailed full human body geometry from a single image. In: ICCV, pp. 2293–2303 (2019)
Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B.: PatchMatch: a randomized correspondence algorithm for structural image editing. ACM TOG 28, 24 (2009)
Bogo, F., Black, M.J., Loper, M., Romero, J.: Detailed full-body reconstructions of moving people from monocular RGB-D sequences. In: ICCV, pp. 2300–2308 (2015)
Bradley, D., Popa, T., Sheffer, A., Heidrich, W., Boubekeur, T.: Markerless garment capture. ACM TOG 27(3), 1–9 (2008)
Chang, J.R., Chen, Y.S.: Pyramid stereo matching network. In: CVPR, pp. 5410–5418 (2018)
Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Chan, W.: WaveGrad: estimating gradients for waveform generation. In: ICLR (2021)
Collet, A., et al.: High-quality streamable free-viewpoint video. ACM TOG 34(4), 69 (2015)
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS, vol. 34 (2021)
Dou, M., et al.: Motion2Fusion: real-time volumetric performance capture. ACM TOG 36(6), 246:1–246:16 (2017)
Dou, M., et al.: Fusion4D: real-time performance capture of challenging scenes. ACM TOG 35(4), 1–13 (2016)
Fanello, S.R., et al.: UltraStereo: efficient learning-based matching for active stereo systems. In: CVPR, pp. 6535–6544 (2017)
Gabeur, V., Franco, J.S., Martin, X., Schmid, C., Rogez, G.: Moulding humans: non-parametric 3D human shape estimation from single images. In: ICCV, pp. 2232–2241 (2019)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: CVPR (2012)
Gilbert, A., Volino, M., Collomosse, J., Hilton, A.: Volumetric performance capture from minimal camera viewpoints. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 591–607. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_35
Guo, K., et al.: The relightables: volumetric performance capture of humans with realistic relighting. ACM TOG 38(6), 1–19 (2019)
Guo, X., Yang, K., Yang, W., Wang, X., Li, H.: Group-wise correlation stereo network. In: CVPR, pp. 3273–3282 (2019)
Hannah, M.J.: Computer Matching of Areas in Stereo Images. Stanford University (1974)
He, T., Xu, Y., Saito, S., Soatto, S., Tung, T.: Arch++: animation-ready clothed human reconstruction revisited. In: ICCV, pp. 11046–11056 (2021)
Hirschmuller, H.: Stereo processing by semiglobal matching and mutual information. IEEE TPAMI 30(2), 328–341 (2008)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS, vol. 33, pp. 6840–6851 (2020)
Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. arXiv preprint arXiv:2106.15282 (2021)
Hong, Y., Zhang, J., Jiang, B., Guo, Y., Liu, L., Bao, H.: StereoPIFu: depth aware clothed human digitization via stereo vision. In: CVPR (2021)
Huang, Z., et al.: Deep volumetric video from very sparse multi-view performance capture. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 351–369. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_21
Huang, Z., Xu, Y., Lassner, C., Li, H., Tung, T.: Arch: animatable reconstruction of clothed humans. In: CVPR (2020)
Jensen, R., Dahl, A., Vogiatzis, G., Tola, E., Aanæs, H.: Large scale multi-view stereopsis evaluation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 406–413. IEEE (2014)
Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: ICCV (2015)
Kazhdan, M., Bolitho, M., Hoppe, H.: Poisson surface reconstruction. In: ESGP, vol. 7 (2006)
Kendall, A., et al.: End-to-end learning of geometry and context for deep stereo regression. In: ICCV, pp. 66–75 (2017)
Kolmogorov, V.: Convergent tree-reweighted message passing for energy minimization. IEEE TPAMI 28(10), 1568–1583 (2006)
Li, H., et al.: SRDiff: single image super-resolution with diffusion probabilistic models. Neurocomputing 479, 47–59 (2022)
Li, J., et al.: Practical stereo matching via cascaded recurrent network with adaptive correlation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16263–16272 (2022)
Li, Z., Yu, T., Zheng, Z., Guo, K., Liu, Y.: POSEFusion: pose-guided selective fusion for single-view human volumetric capture. In: CVPR (2021)
Lipson, L., Teed, Z., Deng, J.: Raft-stereo: multilevel recurrent field transforms for stereo matching. In: 3DV, pp. 218–227 (2021)
Liu, Y., Cao, X., Dai, Q., Xu, W.: Continuous depth estimation for multi-view stereo. In: CVPR, pp. 2121–2128 (2009)
Liu, Y., Dai, Q., Xu, W.: A point-cloud-based multiview stereo algorithm for free-viewpoint video. IEEE TVCG 16(3), 407–418 (2009)
Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: CVPR, pp. 4040–4048 (2016)
Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: CVPR (2015)
Natsume, R., et al.: SiCloPe: silhouette-based clothed people. In: CVPR. pp. 4480–4490 (2019)
Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: ICML, pp. 8162–8171 (2021)
Pons-Moll, G., Pujades, S., Hu, S., Black, M.J.: ClothCap: seamless 4D clothing capture and retargeting. ACM TOG 36(4), 1–15 (2017)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. arXiv preprint arXiv:2112.10752 (2021)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Saharia, C., et al.: Palette: image-to-image diffusion models. In: NeurIPS Workshop (2021)
Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D.J., Norouzi, M.: Image super-resolution via iterative refinement. arXiv:2104.07636 (2021)
Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. In: ICCV, pp. 2304–2314 (2019)
Saito, S., Simon, T., Saragih, J., Joo, H.: PIFuHD: multi-level pixel-aligned implicit function for high-resolution 3D human digitization. In: CVPR, pp. 84–93 (2020)
Shao, R., et al.: DoubleField: bridging the neural surface and radiance fields for high-fidelity human reconstruction and rendering. In: CVPR (2022)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML, pp. 2256–2265 (2015)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations, ICLR (2021)
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations, ICLR (2021)
Starck, J., Hilton, A.: Surface capture for performance-based animation. IEEE Comput. Graphics Appl. 27(3), 21–31 (2007)
Twindom (2020). https://web.twindom.com
Vlasic, D., et al.: Dynamic shape capture using multi-view photometric stereo. ACM TOG 28(5), 174:1–174:11 (2009)
Wang, F., Galliani, S., Vogel, C., Pollefeys, M.: IterMVS: iterative probability estimation for efficient multi-view stereo. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8606–8615 (2022)
Wang, L., Zhao, X., Yu, T., Wang, S., Liu, Y.: NormalGAN: learning detailed 3D human from a single RGB-D image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 430–446. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_26
Wang, S., Li, B., Dai, Y.: Efficient multi-view stereo by iterative dynamic cost volume. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8655–8664 (2022)
Wu, C., Varanasi, K., Liu, Y., Seidel, H., Theobalt, C.: Shading-based dynamic shape refinement from multi-view video under general illumination. In: ICCV, pp. 1108–1115 (2011)
Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: MVSNet: depth inference for unstructured multi-view stereo. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 785–801. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01237-3_47
Yu, T., et al.: BodyFusion: real-time capture of human motion and surface geometry using a single depth camera. In: ICCV, pp. 910–919. IEEE (2017)
Yu, T., Zheng, Z., Guo, K., Liu, P., Dai, Q., Liu, Y.: Function4D: real-time human volumetric capture from very sparse consumer RGBD sensors. In: CVPR, pp. 5746–5756 (2021)
Yu, T., et al.: DoubleFusion: real-time capture of human performances with inner body shapes from a single depth sensor. In: CVPR, pp. 7287–7296. IEEE (2018)
Zabih, R., Woodfill, J.: Non-parametric local transforms for computing visual correspondence. In: Eklundh, J.-O. (ed.) ECCV 1994. LNCS, vol. 801, pp. 151–158. Springer, Heidelberg (1994). https://doi.org/10.1007/BFb0028345
Zhang, F., Prisacariu, V., Yang, R., Torr, P.H.: GA-Net: guided aggregation net for end-to-end stereo matching. In: CVPR, pp. 185–194 (2019)
Zhang, F., Qi, X., Yang, R., Prisacariu, V., Wah, B., Torr, P.: Domain-invariant stereo matching networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 420–439. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_25
Zhang, Y., et al.: Adaptive unimodal cost volume filtering for deep stereo matching. In: AAAI, vol. 34, pp. 12926–12934 (2020)
Zheng, Y., et al.: DeepMultiCap: performance capture of multiple characters using sparse multiview cameras. In: ICCV (2021)
Zheng, Z.: HybridFusion: real-time performance capture using a single depth sensor and sparse IMUs. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 389–406. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_24
Zheng, Z., Yu, T., Liu, Y., Dai, Q.: PaMIR: parametric model-conditioned implicit representation for image-based human reconstruction. IEEE TPAMI 44(6), 3170–3184 (2021)
Zhu, H., Zuo, X., Wang, S., Cao, X., Yang, R.: Detailed human shape estimation from a single image by hierarchical mesh deformation. In: CVPR, pp. 4491–4500 (2019)
Žbontar, J., LeCun, Y.: Computing the stereo matching cost with a convolutional neural network. In: CVPR, pp. 1592–1599 (2015)
Acknowledgements
This paper is supported by National Key R &D Program of China (2021ZD0113501) and the NSFC project No. 62125107 and No. 61827805.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Shao, R., Zheng, Z., Zhang, H., Sun, J., Liu, Y. (2022). DiffuStereo: High Quality Human Reconstruction via Diffusion-Based Stereo Using Sparse Cameras. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13692. Springer, Cham. https://doi.org/10.1007/978-3-031-19824-3_41
Download citation
DOI: https://doi.org/10.1007/978-3-031-19824-3_41
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19823-6
Online ISBN: 978-3-031-19824-3
eBook Packages: Computer ScienceComputer Science (R0)