Skip to main content

End-to-End Visual Editing with a Generatively Pre-trained Artist

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

We consider the targeted image editing problem, namely blending a region in a source image with a driveg that specifiesthe desired change. Differently from prior works, we solve this problem by learning a conditional probability distribution of the edits, end-to-end in code space. Training such a model requires addressing the lack of example edits for training. To this end, we propose a self-supervised approach that simulates edits by augmenting off-the-shelf images in a target domain. The benefits are remarkable: implemented as a state-of-the-art auto-regressive transformer, our approach is simple, sidesteps difficulties with previous methods based on GAN-like priors, obtains significantly better edits, and is efficient. Furthermore, we show that different blending effects can be learned by an intuitive control of the augmentation process, with no other changes required to the model architecture. We demonstrate the superiority of this approach across several datasets in extensive quantitative and qualitative experiments, including human studies, significantly outperforming prior work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    After all, a picture is worth a thousand words!.

  2. 2.

    www.unsplash.com.

References

  1. Abdal, R., Qin, Y., Wonka, P.: Image2stylegan: How to embed images into the stylegan latent space? In: ICCV (2019)

    Google Scholar 

  2. Abdal, R., Qin, Y., Wonka, P.: Image2stylegan++: how to edit the embedded images? In: CVPR (2020)

    Google Scholar 

  3. Abdal, R., Zhu, P., Femiani, J., Mitra, N.J., Wonka, P.: Clip2stylegan: unsupervised extraction of stylegan edit directions. arXiv:2112.05219 [cs.CV] (2021)

  4. Bau, D., et al.: Paint by word. arXiv:2103.10951 [cs.CV] (2021)

  5. Bau, D., et al.: Semantic photo manipulation with a generative image prior. ACM Trans, Graph (2019)

    Google Scholar 

  6. Bau, D., Zhu, J.Y., Strobelt, H., Lapedriza, A., Zhou, B., Torralba, A.: Understanding the role of individual units in a deep neural network. In: Proceedings of the National Academy of Sciences (2020)

    Google Scholar 

  7. Bau, D., et al.: Inverting layers of a large generator. In: ICLR 2019 Debugging Machine Learning Models Workshop (2019)

    Google Scholar 

  8. Chai, L., Wulff, J., Isola, P.: Using latent space regression to analyze and leverage compositionality in gans. In: ICLR (2021)

    Google Scholar 

  9. Chen, M., et al.: Generative pretraining from pixels. In: ICML (2020)

    Google Scholar 

  10. Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In: CVPR (2018)

    Google Scholar 

  11. Crowson, K.: VQGAN-CLIP (2021). https://github.com/nerdyrodent/VQGAN-CLIP

  12. Dai, B., Wipf, D.: Diagnosing and enhancing VAE models. In: ICLR (2019)

    Google Scholar 

  13. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)

    Google Scholar 

  14. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)

    Google Scholar 

  15. Ding, M., et al.: Cogview: mastering text-to-image generation via transformers. In: NeurIPS (2021)

    Google Scholar 

  16. Dolhansky, B., et al.: The deepfake detection challenge dataset (2020)

    Google Scholar 

  17. Esser, P., Rombach, R., Blattmann, A., Ommer, B.: Imagebart: bidirectional context with multinomial diffusion for autoregressive image synthesis. In: NeurIPS (2021)

    Google Scholar 

  18. Esser, P., Rombach, R., Ommer, B.: A disentangling invertible interpretation network for explaining latent representations. In: CVPR (2020)

    Google Scholar 

  19. Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR (2021)

    Google Scholar 

  20. Fauw, J.D., Dieleman, S., Simonyan, K.: Hierarchical autoregressive image models with auxiliary decoders. arXiv:1903.04933 [cs.CV] (2019)

  21. Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., Taigman, Y.: Make-a-scene: scene-based text-to-image generation with human priors (2022)

    Google Scholar 

  22. Galatolo., F., Cimino., M., Vaglini, G.: Generating images from caption and vice versa via clip-guided generative latent space search. In: Proceedings of the International Conference on Image Processing and Vision Engineering (2021)

    Google Scholar 

  23. Ghosh, P., Zietlow, D., Black, M.J., Davis, L.S., Hu, X.: Invgan: invertible gans. arXiv:2112.04598 [cs.CV] (2021)

  24. Goyal, A., Lamb, A., Zhang, Y., Zhang, S., Courville, A., Bengio, Y.: Professor forcing: a new algorithm for training recurrent networks. In: NeurIPS (2016)

    Google Scholar 

  25. Guan, S., Tai, Y., Ni, B., Zhu, F., Huang, F., Yang, X.: Collaborative learning for faster stylegan embedding. arXiv:2007.01758 [cs.CV] (2020)

  26. Härkönen, E., Hertzmann, A., Lehtinen, J., Paris, S.: Ganspace: discovering interpretable gan controls. arXiv:2004.02546 [cs.CV] (2020)

  27. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017)

    Google Scholar 

  28. Holtzman, A., Buys, J., Du, L., Forbes, M., Choi, Y.: The curious case of neural text degeneration. In: ICLR (2020)

    Google Scholar 

  29. Iizuka, S., Simo-Serra, E., Ishikawa, H.: Globally and locally consistent image completion. ACM Trans. Graph. 36(4), 1–14 (2017)

    Article  Google Scholar 

  30. Isola, P., Liu, C.: Scene collaging: analysis and synthesis of natural images with semantic layers. In: ICCV (2013)

    Google Scholar 

  31. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)

    Google Scholar 

  32. Issenhuth, T., Tanielian, U., Mary, J., Picard, D.: Edibert, a generative model for image editing. arXiv:2111.15264 [cs.CV] (2021)

  33. Jahanian, A., Chai, L., Isola, P.: On the "steerability" of generative adversarial networks. In: ICLR (2020)

    Google Scholar 

  34. Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., Aila, T.: Training generative adversarial networks with limited data. In: NeurIPS (2020)

    Google Scholar 

  35. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)

    Google Scholar 

  36. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. In: CVPR (2020)

    Google Scholar 

  37. Kim, H., Choi, Y., Kim, J., Yoo, S., Uh, Y.: Exploiting spatial dimensions of latent in gan for real-time image editing. In: CVPR (2021)

    Google Scholar 

  38. Lipton, Z.C., Tripathi, S.: Precise recovery of latent vectors from generative adversarial networks. arXiv:1702.04782 [cs.LG] (2017)

  39. Liu, G., Reda, F.A., Shih, K.J., Wang, T.C., Tao, A., Catanzaro, B.: Image inpainting for irregular holes using partial convolutions. In: ECCV (2018)

    Google Scholar 

  40. Liu, X., et al.: More control for free! image synthesis with semantic diffusion guidance. arXiv:2112.05744 [cs.CV] (2021)

  41. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

    Google Scholar 

  42. Mokady, R., Benaim, S., Wolf, L., Bermano, A.: Mask based unsupervised content transfer. arXiv:1906.06558 [cs.CV] (2018)

  43. Nichol, A., et al.: Glide: towards photorealistic image generation and editing with text-guided diffusion models (2021)

    Google Scholar 

  44. van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning. arXiv:1711.00937 [cs.LG] (2017)

  45. Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: CVPR (2019)

    Google Scholar 

  46. Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., Lischinski, D.: Styleclip: Text-driven manipulation of stylegan imagery. arXiv:2103.17249 [cs.CV] (2021)

  47. Peebles, W., Peebles, J., Zhu, J.-Y., Efros, A., Torralba, A.: The hessian penalty: a weak prior for unsupervised disentanglement. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 581–597. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_35

    Chapter  Google Scholar 

  48. Press, O., Galanti, T., Benaim, S., Wolf, L.: Emerging disentanglement in auto-encoder based unsupervised image content transfer. In: ICLR (2019)

    Google Scholar 

  49. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434 [cs.LG] (2016)

  50. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)

    Google Scholar 

  51. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents (2022)

    Google Scholar 

  52. Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML (2021)

    Google Scholar 

  53. Razavi, A., van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with vq-vae-2. In: NeurIPS (2019)

    Google Scholar 

  54. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. arXiv:1606.03498 [cs.LG] (2016)

  55. Schaldenbrand, P., Liu, Z., Oh, J.: Styleclipdraw: Coupling content and style in text-to-drawing synthesis. arXiv:2111.03133 [cs.CV] (2021)

  56. Schwettmann, S., Hernandez, E., Bau, D., Klein, S., Andreas, J., Torralba, A.: Toward a visual concept vocabulary for gan latent space. In: ICCV (2021)

    Google Scholar 

  57. Shen, Y., Gu, J., Tang, X., Zhou, B.: Interpreting the latent space of gans for semantic face editing. In: CVPR (2020)

    Google Scholar 

  58. Shi, J., Xu, N., Zheng, H., Smith, A., Luo, J., Xu, C.: Spaceedit: learning a unified editing space for open-domain image editing. arXiv:2112.00180 [cs.CV] (2021)

  59. Shocher, A. et al.: Semantic pyramid for image generation. In: CVPR (2020)

    Google Scholar 

  60. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016)

    Google Scholar 

  61. Tov, O., Alaluf, Y., Nitzan, Y., Patashnik, O., Cohen-Or, D.: Designing an encoder for stylegan image manipulation. arXiv:2102.02766 [cs.CV] (2021)

  62. Tsai, Y.H., Shen, X., Lin, Z.L., Sunkavalli, K., Lu, X., Yang, M.H.: Deep image harmonization. In: CVPR (2017)

    Google Scholar 

  63. Voynov, A., Babenko, A.: Unsupervised discovery of interpretable directions in the gan latent space. In: ICML (2020)

    Google Scholar 

  64. Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In: CVPR (2018)

    Google Scholar 

  65. Williams, R.J., Zipser, D.: A learning algorithm for continually running fully recurrent neural networks. Neural Comput. 1(2), 270–280 (1989)

    Article  Google Scholar 

  66. Wu, C., et al.: N\(\backslash \)" uwa: visual synthesis pre-training for neural visual world creation. arXiv:2111.12417 [cs.CV] (2021)

  67. Wu, Z., Lischinski, D., Shechtman, E.: Stylespace analysis: Disentangled controls for stylegan image generation. arXiv:2011.12799 [cs.CV] (2020)

  68. Xia, W., Yang, Y., Xue, J.H., Wu, B.: Tedigan: text-guided diverse face image generation and manipulation. In: CVPR (2021)

    Google Scholar 

  69. Xiao, Z., Yan, Q., Chen, Y.A., Amit, Y.: Generative latent flow. arXiv:1905.10485 [cs.CV] (2019)

  70. Xu, Y., Shen, Y., Zhu, J., Yang, C., Zhou, B.: Generative hierarchical features from synthesizing images. In: CVPR (2021)

    Google Scholar 

  71. Yang, C., Shen, Y., Zhou, B.: Semantic hierarchy emerges in deep generative representations for scene synthesis. Int. J. Comput. Vis. 129(5), 1451–1466 (2021). https://doi.org/10.1007/s11263-020-01429-5

    Article  Google Scholar 

  72. Yu, F., Zhang, Y., Song, S., Seff, A., Xiao, J.: Lsun: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv:1506.03365 [cs.CV] (2015)

  73. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

    Google Scholar 

  74. Zhang, Z., et al.: UFC-BERT: unifying multi-modal controls for conditional image synthesis. In: NeurIPS (2021)

    Google Scholar 

  75. Zhao, S., et al.: Large scale image completion via co-modulated generative adversarial networks. In: ICLR (2021)

    Google Scholar 

  76. Zhu, J., Shen, Y., Zhao, D., Zhou, B.: In-domain gan inversion for real image editing. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 592–608. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_35

    Chapter  Google Scholar 

  77. Zhu, J.-Y., Krähenbühl, P., Shechtman, E., Efros, A.A.: Generative visual manipulation on the natural image manifold. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 597–613. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_36

    Chapter  Google Scholar 

Download references

Acknowledgements

We are grateful to the advice and support of Yanping Xie, Antoine Toisoul, Thomas Hayes, and the EdiBERT authors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrew Brown .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7603 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Brown, A., Fu, CY., Parkhi, O., Berg, T.L., Vedaldi, A. (2022). End-to-End Visual Editing with a Generatively Pre-trained Artist. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13675. Springer, Cham. https://doi.org/10.1007/978-3-031-19784-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19784-0_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19783-3

  • Online ISBN: 978-3-031-19784-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics