Multi3D: 3D-aware multimodal image synthesis

Zhou, Wenyang; Yuan, Lu; Mu, Taijiang

doi:10.1007/s41095-024-0422-4

Multi3D: 3D-aware multimodal image synthesis

Research Article
Open access
Published: 03 April 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Computational Visual Media Aims and scope Submit manuscript

Multi3D: 3D-aware multimodal image synthesis

Download PDF

Wenyang Zhou¹,
Lu Yuan² &
Taijiang Mu¹

558 Accesses
Explore all metrics

Abstract

3D-aware image synthesis has attained high quality and robust 3D consistency. Existing 3D controllable generative models are designed to synthesize 3D-aware images through a single modality, such as 2D segmentation or sketches, but lack the ability to finely control generated content, such as texture and age. In pursuit of enhancing user-guided controllability, we propose Multi3D, a 3D-aware controllable image synthesis model that supports multi-modal input. Our model can govern the geometry of the generated image using a 2D label map, such as a segmentation or sketch map, while concurrently regulating the appearance of the generated image through a textual description. To demonstrate the effectiveness of our method, we have conducted experiments on multiple datasets, including CelebAMask-HQ, AFHQ-cat, and shapenet-car. Qualitative and quantitative evaluations show that our method outperforms existing state-of-the-art methods.

Article PDF

Cross-modal 3D Shape Generation and Manipulation

Injecting 3D Perception of Controllable NeRF-GAN into StyleGAN for Editable Portrait Image Synthesis

StyleGAN-Human: A Data-Centric Odyssey of Human Generation

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, 2672–2680, 2014.
Brock, A.; Donahue, J.; Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. In: Proceedings of the International Conference on Learning Representations, 2018.
Karras, T.; Laine, S.; Aila, T. M. A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4396–4405, 2019.
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. M. Analyzing and improving the image quality of StyleGAN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8107–8116, 2020.
Isola, P.; Zhu, J. Y.; Zhou, T. H.; Efros, A. A. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5967–5976, 2017.
Zhu, J. Y.; Park, T.; Isola, P.; Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, 2242–2251, 2017.
Park, T.; Liu, M. Y.; Wang, T. C.; Zhu, J. Y. Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2332–2341, 2019.
Chan, E. R.; Lin, C. Z.; Chan, M. A.; Nagano, K.; Pan, B. X.; de Mello, S.; Gallo, O.; Guibas, L.; Tremblay, J.; Khamis, S.; et al. Efficient geometry-aware 3D generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16102–16112, 2022.
OrEl, R.; Luo, X.; Shan, M. Y.; Shechtman, E.; Park, J. J.; Kemelmacher-Shlizerman, I. StyleSDF: High-resolution 3D-consistent image and geometry generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13493–13503, 2022.
Gu, J. T.; Liu, L. J.; Wang, P.; Theobalt, C. StyleNeRF: A style-based 3D-aware generator for high-resolution image synthesis. In: Proceedings of the International Conference on Learning Representations, 2022.
Jiang, K. W.; Chen, S. Y.; Liu, F. L.; Fu, H. B.; Gao, L. NeRFFaceEditing: Disentangled face editing in neural radiance fields. In: Proceedings of the SIGGRAPH Asia Conference Papers, Article No. 31, 2022.
Sun, J.; Wang, X.; Shi, Y.; Wang, L.; Wang, J.; Liu, Y. IDE-3D: Interactive disentangled editing for high-resolution 3D-aware portrait synthesis. ACM Transactions on Graphics Vol. 41, No. 6, Article No. 270, 2022.
Zhou, W. Y.; Yuan, L.; Chen, S. Y.; Gao, L.; Hu, S. M. LC-NeRF: Local controllable face generation in neural radiance field. IEEE Transactions on Visualization and Computer Graphics doi: https://doi.org/10.1109/TVCG.2023.3293653, 2023.
Gao, L.; Liu, F. L.; Chen, S. Y.; Jiang, K. W.; Li, C. P.; Lai, Y. K.; Fu, H. B. SketchFaceNeRF: Sketch-based facial generation and editing in neural radiance fields. ACM Transactions on Graphics Vol. 42, No. 4, Article No. 159, 2023.
Lee, C. H.; Liu, Z. W.; Wu, L. Y.; Luo, P. MaskGAN: Towards diverse and interactive facial image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5548–5557, 2020.
Choi, Y.; Uh, Y.; Yoo, J.; Ha, J. W. StarGAN v2: Diverse image synthesis for multiple domains. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8185–8194, 2020.
Chang, A. X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; et al. ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012, 2015.
Mildenhall, B.; Srinivasan, P. P.; Tancik, M.; Barron, J. T.; Ramamoorthi, R.; Ng, R. NeRF: Representing scenes as neural radiance fields for view synthesis. In: Computer Vision-ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 405–421, 2020.
Müller, T.; Evans, A.; Schied, C.; Keller A. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics Vol. 41, No. 4, Article No. 102, 2022.
Yu, A.; Li, R. L.; Tancik, M.; Li, H.; Ng, R.; Kanazawa, A. PlenOctrees for real-time rendering of neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 5732–5741, 2021.
Fridovich-Keil, S.; Yu, A.; Tancik, M.; Chen, Q. H.; Recht, B.; Kanazawa, A. Plenoxels: Radiance fields without neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5491–5500, 2022.
Shi, Y. C.; Yang, X.; Wan, Y. Y.; Shen, X. H. SemanticStyleGAN: Learning compositional generative priors for controllable image synthesis and editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11244–11254, 2022.
Huang, Z. Y.; Peng, Y. C.; Hibino, T.; Zhao, C. Q.; Xie, H. R.; Fukusato, T.; Miyata, K. DualFace: Two-stage drawing guidance for freehand portrait sketching. Computational Visual Media Vol. 8, No. 1, 63–77, 2022.
Article Google Scholar
Huang, Z. Q.; Chan, K. C. K.; Jiang, Y. M.; Liu, Z. W. Collaborative diffusion for multi-modal face generation and editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6080–6090, 2023.
Liu, X. T.; Wu, W. L.; Li, C. Z.; Li, Y. F.; Wu, H. S. Reference-guided structure-aware deep sketch colorization for cartoons. Computational Visual Media Vol. 8, No. 1, 135–148, 2022.
Article Google Scholar
Zhu, P. H.; Abdal, R.; Qin, Y. P.; Wonka, P. SEAN: Image synthesis with semantic region-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5103–5112, 2020.
Xue, Y.; Guo, Y. C.; Zhang, H.; Xu, T.; Zhang, S. H.; Huang, X. L. Deep image synthesis from intuitive user input: A review and perspectives. Computational Visual Media Vol. 8, No. 1, 3–31, 2022.
Article Google Scholar
Zhou, W. Y.; Yang, G. W.; Hu, S. M. Jittor-GAN: A fast-training generative adversarial network model zoo based on Jittor. Computational Visual Media Vol. 7, No. 1, 153–157, 2021.
Article Google Scholar
Sushko, V.; Schönfeld, E.; Zhang, D.; Gall, J.; Schiele, B.; Khoreva, A. OASIS: Only adversarial supervision for semantic image synthesis. International Journal of Computer Vision Vol. 130, No. 12, 2903–2923, 2022.
Article Google Scholar
Xia, W. H.; Yang, Y. J.; Xue, J. H.; Wu, B. Y. TediGAN: Text-guided diverse face image generation and manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2256–2265, 2021.
Wang, T. C.; Liu, M. Y.; Zhu, J. Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-resolution image synthesis and semantic manipulation with conditional GANs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8798–8807, 2018.
Patashnik, O.; Wu, Z. Z.; Shechtman, E.; Cohen-Or, D.; Lischinski, D. StyleCLIP: Text-driven manipulation of StyleGAN imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2065–2074, 2021.
Chen, A.; Liu, R.; Xie, L.; Chen, Z.; Su, H.; Yu, J. SofGAN: A portrait image generator with dynamic styling. ACM Transactions on Graphics Vol. 41, No. 1, Article No. 1, 2022.
Ling, H.; Kreis, K.; Li, D.; Kim, S. W.; Torralba, A.; Fidler, S. EditGAN: High-precision semantic image editing. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 16331–16345, 2021.
Sun, R. Q.; Huang, C.; Zhu, H. L.; Ma, L. Z. Mask-aware photorealistic facial attribute manipulation. Computational Visual Media Vol. 7, No. 3, 363–374, 2021.
Article Google Scholar
Wang, C.; Tang, F.; Zhang, Y.; Wu, T. R.; Dong, W. M. Towards harmonized regional style transfer and manipulation for facial images. Computational Visual Media Vol. 9, No. 2, 351–366, 2023.
Article Google Scholar
Chen, S. Y.; Su, W. C.; Gao, L.; Xia, S. H.; Fu, H. B. DeepFaceDrawing: Deep generation of face images from sketches. ACM Transactions on Graphics Vol. 39, No. 4, Article No. 72, 2020.
Chen, S. Y.; Liu, F. L.; Lai, Y. K.; Rosin, P. L.; Li, C. P.; Gao, L. DeepFaceEditing: Deep face generation and editing with disentangled geometry and appearance control. ACM Transactions on Graphics Vol. 40, No. 4, Article No. 90, 2021.
Chan, E. R.; Monteiro, M.; Kellnhofer, P.; Wu, J. J.; Wetzstein, G. pi-GAN: Periodic implicit generative adversarial networks for 3D-aware image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5795–5805, 2021.
Niemeyer, M.; Geiger, A. GIRAFFE: Representing scenes as compositional generative neural feature fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11448–11459, 2021.
Deng, Y.; Yang, J. L.; Xiang, J. F.; Tong, X. GRAM: Generative radiance manifolds for 3D-aware image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10663–10673, 2022.
Xiang, J. F.; Yang, J. L.; Deng, Y.; Tong, X. GRAM-HD: 3D-consistent image generation at high resolution with generative radiance manifolds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2195–2205, 2023.
Sun, J. X.; Wang, X.; Zhang, Y.; Li, X. Y.; Zhang, Q.; Liu, Y. B.; Wang, J. FENeRF: Face editing in neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7662–7672, 2022.
Chen, Y. D.; Wu, Q. Y.; Zheng, C. X.; Cham, T. J.; Cai, J. F. Sem2NeRF: Converting single-view semantic masks to neural radiance fields. In: Computer Vision–ECCV 2022. Lecture Notes in Computer Science, Vol. 13674. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 730–748, 2022.
Jiang, K. W.; Chen, S. Y.; Fu, H. B.; Gao, L. NeRFFaceLighting: Implicit and disentangled face lighting representation leveraging generative prior in neural radiance fields. ACM Transactions on Graphics Vol. 42, No. 3, Article No. 35, 2023.
Tang, J. S.; Zhang, B.; Yang, B. X.; Zhang, T.; Chen, D.; Ma, L. Z.; Wen, F. 3DFaceShop: Explicitly controllable 3D-aware portrait generation. IEEE Transactions on Visualization and Computer Graphics doi: https://doi.org/10.1109/TVCG.2023.3323578, 2023.
Sun, J. X.; Wang, X.; Wang, L. Z.; Li, X. Y.; Zhang, Y.; Zhang, H. W.; Liu, Y. B. Next3D: Generative neural texture rasterization for 3D-aware head avatars. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20991–21002, 2023.
Cai, S. Q.; Obukhov, A.; Dai, D. X.; Van Gool, L. Pix2NeRF: Unsupervised conditional $≪$-GAN for single image to neural radiance fields translation In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3971–3980, 2022.
Deng, K. L.; Yang, G. S.; Ramanan, D.; Zhu, J. Y. 3D-aware conditional image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4434–445, 2023.
Zhu, P. H.; Abdal, R.; Qin, Y. P.; Wonka, P. Improved StyleGAN embedding: Where are the good latents? arXiv preprint arXiv:2012.09036, 2020.
Radford, A.;, Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision In: Proceedings of the 38th International Conference on Machine Learning, 8748–8763, 2021.
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 6629–6640, 2017.
Deng, Y.; Yang, J. L.; Xu, S. C.; Chen, D.; Jia, Y. D.; Tong, X. Accurate 3D face reconstruction with weakly-supervised learning: From single image to image set. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 285–295, 2019.
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. In: Proceedings of the International Conference on Learning Representations, 2019.
Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference for Learning Representations, 2015.
Karras, T.; Aittala, M.; Laine, S.; Härkönen, E.; Hellsten, J.; Lehtinen, J.; Aila, T. Alias-free generative adversarial networks. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 852–863, 2021.

Download references

Acknowledgements

This paper was supported by the National Science and Technology Major Project (Grant No. 2021ZD0112902), the National Natural Science Foundation of China (Project No. 62220106003), a Research Grant from Beijing Higher Institution Engineering Research Center, and Tsinghua–Tencent Joint Laboratory for Internet Innovation Technology.

Author information

Authors and Affiliations

BNRist, Tsinghua University, Beijing, 100084, China
Wenyang Zhou & Taijiang Mu
Computer Science Department, Stanford University, California, 94305, USA
Lu Yuan

Authors

Wenyang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Lu Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Taijiang Mu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Wenyang Zhou: Methodology, Experiment, Writing—Original Draft. Lu Yuan: Experiment, Writing—Original Draft. Taijiang Mu: Writing—Review and Editing, Supervision.

Corresponding author

Correspondence to Taijiang Mu.

Ethics declarations

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Wenyang Zhou is currently a Ph.D. student in the Department of Computer Science and Technology, Tsinghua University. His research interests include computer graphics, 3D-aware generation, and computer vision.

Lu Yuan is currently a master student at Stanford University. Her research interests include computer graphics and computer vision.

Taijiang Mu is currently a research assistant in the Department of Computer Science and Technology, Tsinghua University, where he received his bachelor and doctor degrees in 2011 and 2016, respectively. His research interests include computer graphics, visual media learning, 3D reconstruction, and 3D understanding.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.

Reprints and permissions

About this article

Cite this article

Zhou, W., Yuan, L. & Mu, T. Multi3D: 3D-aware multimodal image synthesis. Comp. Visual Media (2024). https://doi.org/10.1007/s41095-024-0422-4

Download citation

Received: 09 December 2023
Accepted: 29 February 2024
Published: 03 April 2024
DOI: https://doi.org/10.1007/s41095-024-0422-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multi3D: 3D-aware multimodal image synthesis

Abstract

Article PDF

Similar content being viewed by others

Cross-modal 3D Shape Generation and Manipulation

Injecting 3D Perception of Controllable NeRF-GAN into StyleGAN for Editable Portrait Image Synthesis

StyleGAN-Human: A Data-Centric Odyssey of Human Generation

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi3D: 3D-aware multimodal image synthesis

Abstract

Article PDF

Similar content being viewed by others

Cross-modal 3D Shape Generation and Manipulation

Injecting 3D Perception of Controllable NeRF-GAN into StyleGAN for Editable Portrait Image Synthesis

StyleGAN-Human: A Data-Centric Odyssey of Human Generation

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation