RetrieveGAN: Image Synthesis via Differentiable Patch Retrieval

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12353)


Image generation from scene description is a cornerstone technique for the controlled generation, which is beneficial to applications such as content creation and image editing. In this work, we aim to synthesize images from scene description with retrieved patches as reference. We propose a differentiable retrieval module. With the differentiable retrieval module, we can (1) make the entire pipeline end-to-end trainable, enabling the learning of better feature embedding for retrieval; (2) encourage the selection of mutually compatible patches with additional objective functions. We conduct extensive quantitative and qualitative experiments to demonstrate that the proposed method can generate realistic and diverse images, where the retrieved patches are reasonable and mutually compatible.



This work is supported in part by the NSF CAREER Grant #1149783.

Supplementary material

504445_1_En_15_MOESM1_ESM.pdf (1 mb)
Supplementary material 1 (pdf 1068 KB)


  1. 1.
    Ak, K.E., Kassim, A.A., Hwee Lim, J., Yew Tham, J.: Learning attribute representations with localization for flexible fashion search. In: CVPR (2018)Google Scholar
  2. 2.
    Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: thing and stuff classes in context. In: CVPR (2018)Google Scholar
  3. 3.
    Chen, Y., Gong, S., Bazzani, L.: Image search with text feedback by visiolinguistic attention learning. In: CVPR (2020)Google Scholar
  4. 4.
    Choi, Y., Uh, Y., Yoo, J., Ha, J.W.: StarGAN v2: diverse image synthesis for multiple domains. In: CVPR (2020)Google Scholar
  5. 5.
    Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS (2014)Google Scholar
  6. 6.
    Han, X., et al.: Automatic spatially-aware fashion concept discovery. In: ICCV (2017)Google Scholar
  7. 7.
    Hays, J., Efros, A.A.: IM2GPS: estimating geographic information from a single image. In: CVPR (2008)Google Scholar
  8. 8.
    Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 (2017)
  9. 9.
    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017)Google Scholar
  10. 10.
    Huang, H.-P., Tseng, H.-Y., Lee, H.-Y., Huang, J.-B.: Semantic view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 592–608. Springer, Cham (2020). Scholar
  11. 11.
    Jang, E., Gu, S., Poole, B.: Categorical reparameterization with Gumbel-Softmax. In: ICLR (2017)Google Scholar
  12. 12.
    Jiang, L., Meng, D., Mitamura, T., Hauptmann, A.G.: Easy samples first: self-paced reranking for zero-example multimedia search. In: ACM MM (2014)Google Scholar
  13. 13.
    Johnson, J., Gupta, A., Fei-Fei, L.: Image generation from scene graphs. In: CVPR (2018)Google Scholar
  14. 14.
    Johnson, J., Krishna, R., Stark, M., Li, L.J., Shamma, D., Bernstein, M., Fei-Fei, L.: Image retrieval using scene graphs. In: CVPR (2015)Google Scholar
  15. 15.
    Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: ICLR (2014)Google Scholar
  16. 16.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123(1), 32–73 (2017)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Lee, H.Y., et al.: Drit++: diverse image-to-image translation via disentangled representations. IJCV 128, 2402–2417 (2020). Scholar
  18. 18.
    Lee, H.Y., Yang, W., Jiang, L., Le, M., Essa, I., Gong, H., Yang, M.H.: Neural design network: Graphic layout generation with constraints. In: ECCV. Springer, Heidelberg (2020)Google Scholar
  19. 19.
    Lee, H.Y., et al.: Dancing to music. In: NeurIPS (2019)Google Scholar
  20. 20.
    Li, W., et al.: Object-driven text-to-image synthesis via adversarial training. In: CVPR (2019)Google Scholar
  21. 21.
    Li, Y., Jiang, L., Yang, M.H.: Controllable and progressive image extrapolation. arXiv preprint arXiv:1912.11711 (2019)
  22. 22.
    Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In: CVPR (2016)Google Scholar
  23. 23.
    Mao, Q., Lee, H.Y., Tseng, H.Y., Ma, S., Yang, M.H.: Mode seeking generative adversarial networks for diverse image synthesis. In: CVPR (2019)Google Scholar
  24. 24.
    Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: CVPR (2019)Google Scholar
  25. 25.
    Qi, X., Chen, Q., Jia, J., Koltun, V.: Semi-parametric image synthesis. In: CVPR (2018)Google Scholar
  26. 26.
    Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: NeurIPS (2016)Google Scholar
  27. 27.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)Google Scholar
  28. 28.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016)Google Scholar
  29. 29.
    Tan, F., Feng, S., Ordonez, V.: Text2Scene: generating compositional scenes from textual descriptions. In: CVPR (2019)Google Scholar
  30. 30.
    Tseng, H.Y., Fisher, M., Lu, J., Li, Y., Kim, V., Yang, M.H.: Modeling artistic workflows for image generation and editing. In: ECCV. Springer, Heidelberg (2020)Google Scholar
  31. 31.
    Vo, N.N., Hays, J.: Localizing and orienting street views using overhead imagery. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 494–509. Springer, Cham (2016). Scholar
  32. 32.
    Vo, N., et al.: Composing text and image for image retrieval-an empirical odyssey. In: CVPR (2019)Google Scholar
  33. 33.
    Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: CVPR (2018)Google Scholar
  34. 34.
    Wang, X., He, K., Gupta, A.: Transitive invariance for self-supervised visual representation learning. In: ICCV (2017)Google Scholar
  35. 35.
    Xie, S.M., Ermon, S.: Reparameterizable subset sampling via continuous relaxations. In: IJCAI (2019)Google Scholar
  36. 36.
    Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: CVPR (2018)Google Scholar
  37. 37.
    Yikang, L., Ma, T., Bai, Y., Duan, N., Wei, S., Wang, X.: PasteGAN: a semi-parametric method to generate image from scene graph. In: NeurIPS (2019)Google Scholar
  38. 38.
    Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: ICCV (2017)Google Scholar
  39. 39.
    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)Google Scholar
  40. 40.
    Zhao, B., Meng, L., Yin, W., Sigal, L.: Image generation from layout. In: CVPR (2019)Google Scholar
  41. 41.
    Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networkss. In: ICCV (2017)Google Scholar
  42. 42.
    Zhu, M., Pan, P., Chen, W., Yang, Y.: DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In: ICCV (2019)Google Scholar
  43. 43.
    Zhu, P., Abdal, R., Qin, Y., Wonka, P.: SEAN: image synthesis with semantic region-adaptive normalization. In: CVPR (2020)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Google ResearchMountain ViewUSA
  2. 2.University of CaliforniaMercedUSA
  3. 3.Yonsei UniversitySeoulSouth Korea

Personalised recommendations