Diffusion-Based Semantic Image Synthesis from Sparse Layouts

Huang, Yuantian; Iizuka, Satoshi; Fukui, Kazuhiro

doi:10.1007/978-3-031-50072-5_35

Yuantian Huang¹²,
Satoshi Iizuka¹² &
Kazuhiro Fukui¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14496))

Included in the following conference series:

Computer Graphics International Conference

500 Accesses

Abstract

We present an efficient framework that utilizes diffusion models to generate landscape images from sparse semantic layouts. Previous approaches use dense semantic label maps to generate images, where the quality of the results is highly dependent on the accuracy of the input semantic layouts. However, it is not trivial to create detailed and accurate semantic layouts in practice. To address this challenge, we carefully design a random masking process that effectively simulates real user input during the model training phase, making it more practical for real-world applications. Our framework leverages the Semantic Diffusion Model (SDM) as a generator to create full landscape images from sparse label maps, which are created randomly during the random masking process. Missing semantic information is complemented based on the learned image structure. Furthermore, we achieve comparable inference speed to GAN-based models through a model distillation process while preserving the generation quality. After training with the well-designed random masking process, our proposed framework is able to generate high-quality landscape images with sparse and intuitive inputs, which is useful for practical applications. Experiments show that our proposed method outperforms existing approaches both quantitatively and qualitatively. Code is available at https://github.com/sky24h/SIS_from_Sparse_Layouts.

This study was supported by the Japan Science and Technology Agency Support for Pioneering Research Initiated by the Next Generation (JST SPRING); Grant Number JPMJSP2124.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ashual, O., Wolf, L.: Specifying object attributes and relations in interactive scene generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1511–1520 (2017)
Google Scholar
Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797 (2018)
Google Scholar
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: Advances in Neural Information Processing Systems 34, pp. 8780–8794 (2021)
Google Scholar
Gao, C., Liu, Q., Xu, Q., Wang, L., Liu, J., Zou, C.: SketchyCOCO: image generation from freehand scene sketches. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5174–5183 (2020)
Google Scholar
Ghosh, A., et al.: Interactive sketch & fill: multiclass sketch-to-image translation. In: Proceedings of the IEEE International Conference on Computer Vision (2019)
Google Scholar
Goodfellow, I.J., et al.: Generative adversarial networks (2014)
Google Scholar
Hertzmann, A.: Can computers create art? Arts 7(2), 18 (2018)
Article Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems, pp. 6626–6637 (2017)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems 33, pp. 6840–6851 (2020)
Google Scholar
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)
Google Scholar
Johnson, J., Gupta, A., Fei-Fei, L.: Image generation from scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
Google Scholar
Li, K., Zhang, T., Malik, J.: Diverse image synthesis from semantic layouts via conditional IMLE. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4220–4229 (2019)
Google Scholar
Li, L., Tang, J., Shao, Z., Tan, X., Ma, L.: Sketch-to-photo face generation based on semantic consistency preserving and similar connected component refinement. Vis. Comput. 38(11), 3577–3594 (2022)
Article Google Scholar
Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, pp. 8162–8171. PMLR (2021)
Google Scholar
Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2337–2346 (2019)
Google Scholar
Qi, X., Chen, Q., Jia, J., Koltun, V.: Semi-parametric image synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8808–8816 (2018)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)
Razavi, A., van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with VQ-VAE-2. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, June 2022
Google Scholar
Saharia, C., et al.: Palette: image-to-image diffusion models. In: ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–10 (2022)
Google Scholar
Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D.J., Norouzi, M.: Image super-resolution via iterative refinement. IEEE Trans. Pattern Anal. Mach. Intell. 45(4), 4713–4726 (2022)
Google Scholar
Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. In: International Conference on Learning Representations (2022)
Google Scholar
Sasaki, H., Willcocks, C.G., Breckon, T.P.: UNIT-DDPM: UNpaired image translation with denoising diffusion probabilistic models. arXiv preprint arXiv:2104.05358 (2021)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning, pp. 2256–2265. PMLR (2015)
Google Scholar
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)
Sushko, V., Schönfeld, E., Zhang, D., Gall, J., Schiele, B., Khoreva, A.: You only need adversarial supervision for semantic image synthesis. arXiv preprint arXiv:2012.04781 (2020)
Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807 (2018)
Google Scholar
Wang, W., et al.: Semantic image synthesis via diffusion models (2022)
Google Scholar
Yu, Y., Li, D., Li, B., Li, N.: Multi-style image generation based on semantic image. Vis. Comput. 1–16 (2023). https://doi.org/10.1007/s00371-023-03042-2
Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), October 2017
Google Scholar
Zhang, Z., et al.: Stroke-based semantic segmentation for scene-level free-hand sketches. Vis. Comput. 39, 6309–6321 (2022). https://doi.org/10.1007/s00371-022-02731-8
Article Google Scholar
Zhu, P., Abdal, R., Qin, Y., Wonka, P.: SEAN: image synthesis with semantic region-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5104–5113 (2020)
Google Scholar
Zhu, Z., Xu, Z., You, A., Bai, X.: Semantically multi-modal image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5467–5476 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Tsukuba, Ibaraki, 305-8577, Japan
Yuantian Huang, Satoshi Iizuka & Kazuhiro Fukui

Authors

Yuantian Huang
View author publications
You can also search for this author in PubMed Google Scholar
Satoshi Iizuka
View author publications
You can also search for this author in PubMed Google Scholar
Kazuhiro Fukui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuantian Huang .

Editor information

Editors and Affiliations

Shanghai Jiao Tong University, Shanghai, China
Bin Sheng
Shanghai Jiao Tong University, Shanghai, China
Lei Bi
University of Sydney, Sydney, NSW, Australia
Jinman Kim
MIRALab-CUI, University of Geneve, Carouge, Geneve, Switzerland
Nadia Magnenat-Thalmann
Swiss Federal Institute of Technology, Lausanne, Switzerland
Daniel Thalmann

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, Y., Iizuka, S., Fukui, K. (2024). Diffusion-Based Semantic Image Synthesis from Sparse Layouts. In: Sheng, B., Bi, L., Kim, J., Magnenat-Thalmann, N., Thalmann, D. (eds) Advances in Computer Graphics. CGI 2023. Lecture Notes in Computer Science, vol 14496. Springer, Cham. https://doi.org/10.1007/978-3-031-50072-5_35

Download citation

DOI: https://doi.org/10.1007/978-3-031-50072-5_35
Published: 29 December 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-50071-8
Online ISBN: 978-3-031-50072-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Diffusion-Based Semantic Image Synthesis from Sparse Layouts