Controllable Image Synthesis via SegVAE

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12352)


Flexible user controls are desirable for content creation and image editing. A semantic map is commonly used intermediate representation for conditional image generation. Compared to the operation on raw RGB pixels, the semantic map enables simpler user modification. In this work, we specifically target at generating semantic maps given a label-set consisting of desired categories. The proposed framework, SegVAE, synthesizes semantic maps in an iterative manner using conditional variational autoencoder. Quantitative and qualitative experiments demonstrate that the proposed model can generate realistic and diverse semantic maps. We also apply an off-the-shelf image-to-image translation model to generate realistic RGB images to better understand the quality of the synthesized semantic maps. Finally, we showcase several real-world image-editing applications including object removal, insertion, and replacement.



This work is supported in part by the NSF CAREER Grant \(\#1149783\), MOST 108-2634-F-007-016-, and MOST 109-2634-F-007-016-.

Supplementary material

504444_1_En_10_MOESM1_ESM.pdf (4.4 mb)
Supplementary material 1 (pdf 4536 KB)


  1. 1.
    Denton, E., Fergus, R.: Stochastic video generation with a learned prior. In: ICML (2018)Google Scholar
  2. 2.
    Goodfellow, I., et al.: Generative adversarial nets. In: NIPS (2014)Google Scholar
  3. 3.
    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: NIPS (2017)Google Scholar
  4. 4.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  5. 5.
    Hong, S., Yang, D., Choi, J., Lee, H.: Inferring semantic layout for hierarchical text-to-image synthesis. In: CVPR (2018)Google Scholar
  6. 6.
    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)Google Scholar
  7. 7.
    Johnson, J., Gupta, A., Fei-Fei, L.: Image generation from scene graphs. In: CVPR (2018)Google Scholar
  8. 8.
    Jyothi, A.A., Durand, T., He, J., Sigal, L., Mori, G.: LayoutVAE: stochastic scene layout generation from a label set. In: ICCV (2019)Google Scholar
  9. 9.
    Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)Google Scholar
  10. 10.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  11. 11.
    Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: ICLR (2014)Google Scholar
  12. 12.
    Ledig, C., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: CVPR (2017)Google Scholar
  13. 13.
    Lee, C.H., Liu, Z., Wu, L., Luo, P.: MaskGAN: Towards diverse and interactive facial image manipulation. arXiv preprint arXiv:1907.11922 (2019)
  14. 14.
    Lee, D., Liu, S., Gu, J., Liu, M.Y., Yang, M.H., Kautz, J.: Context-aware synthesis and placement of object instances. In: NeurIPS (2018)Google Scholar
  15. 15.
    Lee, H.-Y., Tseng, H.-Y., Huang, J.-B., Singh, M., Yang, M.-H.: Diverse image-to-image translation via disentangled representations. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 36–52. Springer, Cham (2018). Scholar
  16. 16.
    Lee, H.-Y., et al.: DRIT++: diverse image-to-image translation via disentangled representations. Int. J. Comput. Vis. 128(10), 2402–2417 (2020). Scholar
  17. 17.
    Lee, H.Y., et al.: Neural design network: graphic layout generation with constraints. In: ECCV (2020)Google Scholar
  18. 18.
    Lee, H.Y., et al.: Dancing to music. In: NeurIPS (2019)Google Scholar
  19. 19.
    Li, W., et al.: Object-driven text-to-image synthesis via adversarial training. In: CVPR (2019)Google Scholar
  20. 20.
    Li, Y., Min, M.R., Shen, D., Carlson, D., Carin, L.: Video generation from text. In: AAAI (2018)Google Scholar
  21. 21.
    Liang, X., et al.: Deep human parsing with active template regression. TPAMI 37(12), 2402–2414 (2015)CrossRefGoogle Scholar
  22. 22.
    Liang, X., et al.: Human parsing with contextualized convolutional neural network. In: ICCV (2015)Google Scholar
  23. 23.
    Lin, C.H., Yumer, E., Wang, O., Shechtman, E., Lucey, S.: ST-GAN: spatial transformer generative adversarial networks for image compositing. In: CVPR (2018)Google Scholar
  24. 24.
    Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al.: Conditional image generation with PixelCNN decoders. In: NIPS (2016)Google Scholar
  25. 25.
    Van den Oord, A., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. In: ICML (2016)Google Scholar
  26. 26.
    Pan, J., et al.: Video generation from single semantic label map. In: CVPR (2019)Google Scholar
  27. 27.
    Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: CVPR (2019)Google Scholar
  28. 28.
    Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)Google Scholar
  29. 29.
    Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: ICML (2014)Google Scholar
  30. 30.
    Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. In: NIPS (2015)Google Scholar
  31. 31.
    Sun, W., Wu, T.: Image synthesis from reconfigurable layout and style. In: ICCV (2019)Google Scholar
  32. 32.
    Suzuki, R., Koyama, M., Miyato, T., Yonetsuji, T.: Spatially controllable image synthesis with internal representation collaging. arXiv preprint arXiv:1811.10153 (2018)
  33. 33.
    Talavera, A., Tan, D.S., Azcarraga, A., Hua, K.: Layout and context understanding for image synthesis with scene graphs. In: ICIP (2019)Google Scholar
  34. 34.
    Tan, F., Feng, S., Ordonez, V.: Text2Scene: generating compositional scenes from textual descriptions. In: CVPR (2019)Google Scholar
  35. 35.
    Tripathi, S., Bhiwandiwalla, A., Bastidas, A., Tang, H.: Heuristics for image generation from scene graphs. In: ICLR workshop (2019)Google Scholar
  36. 36.
    Tseng, H.Y., Fisher, M., Lu, J., Li, Y., Kim, V., Yang, M.H.: Modeling artistic workflows for image generation and editing. In: ECCV (2020)Google Scholar
  37. 37.
    Tseng, H.Y., Lee, H.Y., Jiang, L., Yang, W., Yang, M.H.: RetrieveGAN: image synthesis via differentiable patch retrieval. In: ECCV (2020)Google Scholar
  38. 38.
    Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: CVPR (2018)Google Scholar
  39. 39.
    Wang, T.H., Cheng, Y.C., Lin, C.H., Chen, H.T., Sun, M.: Point-to-point video generation. In: ICCV (2019)Google Scholar
  40. 40.
    Yang, J., Hua, K., Wang, Y., Wang, W., Wang, H., Shen, J.: Automatic objects removal for scene completion. In: INFOCOM WKSHPS (2014)Google Scholar
  41. 41.
    Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: ICCV (2017)Google Scholar
  42. 42.
    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)Google Scholar
  43. 43.
    Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.University of California, MercedMercedUSA
  2. 2.National Tsing Hua UniversityHsinchu CityTaiwan
  3. 3.Google ResearchMountain ViewUSA

Personalised recommendations