Cross-modal 3D Shape Generation and Manipulation

Cheng, Zezhou; Chai, Menglei; Ren, Jian; Lee, Hsin-Ying; Olszewski, Kyle; Huang, Zeng; Maji, Subhransu; Tulyakov, Sergey

doi:10.1007/978-3-031-20062-5_18

Zezhou Cheng¹²,
Menglei Chai¹³,
Jian Ren¹³,
Hsin-Ying Lee¹³,
Kyle Olszewski¹³,
Zeng Huang¹³,
Subhransu Maji¹² &
…
Sergey Tulyakov¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13663))

Included in the following conference series:

European Conference on Computer Vision

1670 Accesses
10 Citations

Abstract

Creating and editing the shape and color of 3D objects require tremendous human effort and expertise. Compared to direct manipulation in 3D interfaces, 2D interactions such as sketches and scribbles are usually much more natural and intuitive for the users. In this paper, we propose a generic multi-modal generative model that couples the 2D modalities and implicit 3D representations through shared latent spaces. With the proposed model, versatile 3D generation and manipulation are enabled by simply propagating the editing from a specific 2D controlling modality through the latent spaces. For example, editing the 3D shape by drawing a sketch, re-colorizing the 3D surface via painting color scribbles on the 2D rendering, or generating 3D shapes of a certain category given one or a few reference images. Unlike prior works, our model does not require re-training or fine-tuning per editing task and is also conceptually simple, easy to implement, robust to input domain shifts, and flexible to diverse reconstruction on partial 2D inputs. We evaluate our framework on two representative 2D modalities of grayscale line sketches and rendered color images, and demonstrate that our method enables various shape manipulation and generation tasks with these 2D modalities.

This work was mainly done while the first author was an intern at Snap Inc. Code and data are available at https://people.cs.umass.edu/~zezhoucheng/edit3d/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abdal, R., Qin, Y., Wonka, P.: Image2stylegan: how to embed images into the stylegan latent space? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4432–4441 (2019)
Google Scholar
An, X., Tong, X., Denning, J.D., Pellacini, F.: AppWarp: retargeting measured materials by appearance-space warping. In: Proceedings of the 2011 SIGGRAPH Asia Conference, pp. 1–10 (2011)
Google Scholar
Athar, S., Burnaev, E., Lempitsky, V.: Latent convolutional models. In: ICLR (2018)
Google Scholar
Bau, D., Liu, S., Wang, T., Zhu, J.-Y., Torralba, A.: Rewriting a deep generative model. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 351–369. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_21
Chapter Google Scholar
Bau, D., Strobelt, H., Peebles, W., Zhou, B., Zhu, J.Y., Torralba, A., et al.: Semantic photo manipulation with a generative image prior. arXiv preprint arXiv:2005.07727 (2020)
Burt, P.J., Adelson, E.H.: The Laplacian pyramid as a compact image code. In: Readings in computer vision, pp. 671–679. Elsevier (1987)
Google Scholar
Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
Chen, K., Choy, C.B., Savva, M., Chang, A.X., Funkhouser, T., Savarese, S.: Text2Shape: generating shapes from natural language by learning joint embeddings. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11363, pp. 100–116. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20893-6_7
Chapter Google Scholar
Chen, Z., Zhang, H.: Learning implicit fields for generative shape modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5939–5948 (2019)
Google Scholar
Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797 (2018)
Google Scholar
Choy, C.B., Xu, D., Gwak, J.Y., Chen, K., Savarese, S.: 3D-R2N2: a unified approach for single and multi-view 3D object reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 628–644. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_38
Chapter Google Scholar
DeCarlo, D., Finkelstein, A., Rusinkiewicz, S., Santella, A.: Suggestive contours for conveying shape. In: ACM SIGGRAPH 2003 Papers, pp. 848–855 (2003)
Google Scholar
Delanoy, J., Aubry, M., Isola, P., Efros, A.A., Bousseau, A.: 3D sketching using multi-view deep volumetric prediction. Proc. ACM Comput. Graph. Interact. Tech. 1(1), 1–22 (2018)
Article Google Scholar
Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3d object reconstruction from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 605–613 (2017)
Google Scholar
Gkioxari, G., Malik, J., Johnson, J.: Mesh R-CNN. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9785–9795 (2019)
Google Scholar
Goel, S., Kanazawa, A., Malik, J.: Shape and viewpoint without keypoints. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 88–104. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_6
Chapter Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Google Scholar
Grady, L.: Random walks for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 28(11), 1768–1783 (2006)
Article Google Scholar
Gu, J., Shen, Y., Zhou, B.: Image processing using multi-code GAN prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3012–3021 (2020)
Google Scholar
Guillard, B., Remelli, E., Yvernay, P., Fua, P.: Sketch2Mesh: reconstructing and editing 3D shapes from sketches. In: ICCV (2021)
Google Scholar
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.: Improved training of wasserstein GANs. In: NeurIPS (2017)
Google Scholar
Hao, Z., Averbuch-Elor, H., Snavely, N., Belongie, S.: Dualsdf: semantic shape manipulation using a two-level representation. In: CVPR, pp. 7631–7641 (2020)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local NASH equilibrium. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Jin, A., Fu, Q., Deng, Z.: Contour-based 3D modeling through joint embedding of shapes and contours. In: Symposium on Interactive 3D Graphics and Games, pp. 1–10 (2020)
Google Scholar
Kanazawa, A., Tulsiani, S., Efros, A.A., Malik, J.: Learning category-specific mesh reconstruction from image collections. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 371–386 (2018)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)
Lempitsky, V., Kohli, P., Rother, C., Sharp, T.: Image segmentation with a bounding box prior. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 277–284. IEEE (2009)
Google Scholar
Levin, A., Lischinski, D., Weiss, Y.: Colorization using optimization. In: ACM SIGGRAPH 2004 Papers, pp. 689–694 (2004)
Google Scholar
Li, Y., Sun, J., Tang, C.K., Shum, H.Y.: Lazy snapping. ACM Trans. Graph. (ToG) 23(3), 303–308 (2004)
Article Google Scholar
Liu, M.Y., Tuzel, O.: Coupled generative adversarial networks. Adv. Neural. Inf. Process. Syst. 29, 469–477 (2016)
Google Scholar
Liu, S., Zhang, Y., Peng, S., Shi, B., Pollefeys, M., Cui, Z.: DIST: rendering deep implicit signed distance function with differentiable sphere tracing. In: CVPR, pp. 2019–2028 (2020)
Google Scholar
Liu, S., Li, T., Chen, W., Li, H.: Soft rasterizer: a differentiable renderer for image-based 3D reasoning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7708–7717 (2019)
Google Scholar
Liu, S., Zhang, X., Zhang, Z., Zhang, R., Zhu, J.Y., Russell, B.: Editing conditional radiance fields. In: ICCV (2021)
Google Scholar
Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: Learning 3d reconstruction in function space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4460–4470 (2019)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24
Chapter Google Scholar
Pan, X., Zhan, X., Dai, B., Lin, D., Loy, C.C., Luo, P.: Exploiting deep generative prior for versatile image restoration and manipulation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 262–277. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_16
Chapter Google Scholar
Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: Deepsdf: learning continuous signed distance functions for shape representation. In: CVPR, pp. 165–174 (2019)
Google Scholar
Pellacini, F., Battaglia, F., Morley, R.K., Finkelstein, A.: Lighting with paint. ACM Trans. Graph. (TOG) 26(2), 9-es (2007)
Google Scholar
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
Rother, C., Kolmogorov, V., Blake, A.: “ grabcut" interactive foreground extraction using iterated graph cuts. ACM Trans. Graph. (TOG) 23(3), 309–314 (2004)
Google Scholar
Saharia, C., et al.: Palette: Image-to-image diffusion models. arXiv preprint arXiv:2111.05826 (2021)
Schmidt, T.W., Pellacini, F., Nowrouzezahrai, D., Jarosz, W., Dachsbacher, C.: State of the art in artistic editing of appearance, lighting and material. In: Computer Graphics Forum, vol. 35, pp. 216–233. Wiley Online Library (2016)
Google Scholar
Shen, Y., Gu, J., Tang, X., Zhou, B.: Interpreting the latent space of GANs for semantic face editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9243–9252 (2020)
Google Scholar
Shen, Y., Yang, C., Tang, X., Zhou, B.: InterfaceGAN: interpreting the disentangled face representation learned by GANs. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020)
Google Scholar
Shi, Y., Siddharth, N., Paige, B., Torr, P.H.: Variational mixture-of-experts autoencoders for multi-modal deep generative models. arXiv preprint arXiv:1911.03393 (2019)
Sitzmann, V., Zollhöfer, M., Wetzstein, G.: Scene representation networks: continuous 3D-structure-aware neural scene representations. arXiv preprint arXiv:1906.01618 (2019)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML, pp. 2256–2265. PMLR (2015)
Google Scholar
Suzuki, M., Nakayama, K., Matsuo, Y.: Joint multimodal learning with deep generative models. arXiv preprint arXiv:1611.01891 (2016)
Tatarchenko, M., Richter, S.R., Ranftl, R., Li, Z., Koltun, V., Brox, T.: What do single-view 3d reconstruction networks learn? In: CVPR, pp. 3405–3414 (2019)
Google Scholar
Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.G.: Pixel2mesh: generating 3D mesh models from single RGB images. In: ECCV, pp. 52–67 (2018)
Google Scholar
Wang, Y., Gonzalez-Garcia, A., Berga, D., Herranz, L., Khan, F.S., Weijer, J.V.D.: MineGAN: effective knowledge transfer from GANs to target domains with few images. In: CVPR, pp. 9332–9341 (2020)
Google Scholar
Wu, M., Goodman, N.: Multimodal generative models for scalable weakly-supervised learning. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Google Scholar
Wu, M., Goodman, N.: Multimodal generative models for compositional representation learning. arXiv preprint arXiv:1912.05075 (2019)
Xu, Q., Wang, W., Ceylan, D., Mech, R., Neumann, U.: Disn: deep implicit surface network for high-quality single-view 3D reconstruction. arXiv preprint arXiv:1905.10711 (2019)
Yang, G., et al.: LASR: learning articulated shape reconstruction from a monocular video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15980–15989 (2021)
Google Scholar
Zadeh, A., Lim, Y.C., Liang, P.P., Morency, L.P.: Variational auto-decoder: a method for neural generative modeling from incomplete data. arXiv preprint arXiv:1903.00840 (2019)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
Google Scholar
Zhang, R., et al.: Real-time user-guided image colorization with learned deep priors. ACM Trans. Graph. (TOG) 9(4) (2017)
Google Scholar
Zhang, S.H., Guo, Y.C., Gu, Q.W.: Sketch2model: view-aware 3D modeling from single free-hand sketches. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6012–6021 (2021)
Google Scholar
Zhong, Y., Gryaditskaya, Y., Zhang, H., Song, Y.Z.: Deep sketch-based modeling: tips and tricks. In: 2020 International Conference on 3D Vision (3DV), pp. 543–552. IEEE (2020)
Google Scholar
Zhu, J., Shen, Y., Zhao, D., Zhou, B.: In-domain GAN inversion for real image editing. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 592–608. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_35
Chapter Google Scholar

Download references

Acknowledgements

Subhransu Maji acknowledges support from NSF grants #1749833 and #1908669. Our experiments were partially performed on the University of Massachusetts GPU cluster funded by the Mass. Technology Collaborative.

Author information

Authors and Affiliations

University of Massachusetts, Amherst, USA
Zezhou Cheng & Subhransu Maji
Snap Inc., Santa Monica, USA
Menglei Chai, Jian Ren, Hsin-Ying Lee, Kyle Olszewski, Zeng Huang & Sergey Tulyakov

Authors

Zezhou Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Menglei Chai
View author publications
You can also search for this author in PubMed Google Scholar
Jian Ren
View author publications
You can also search for this author in PubMed Google Scholar
Hsin-Ying Lee
View author publications
You can also search for this author in PubMed Google Scholar
Kyle Olszewski
View author publications
You can also search for this author in PubMed Google Scholar
Zeng Huang
View author publications
You can also search for this author in PubMed Google Scholar
Subhransu Maji
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Tulyakov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Menglei Chai .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2985 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cheng, Z. et al. (2022). Cross-modal 3D Shape Generation and Manipulation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13663. Springer, Cham. https://doi.org/10.1007/978-3-031-20062-5_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-20062-5_18
Published: 11 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20061-8
Online ISBN: 978-3-031-20062-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Cross-modal 3D Shape Generation and Manipulation