Skip to main content

Text2LIVE: Text-Driven Layered Image and Video Editing

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13675))

Included in the following conference series:

Abstract

We present a method for zero-shot, text-driven editing of natural images and videos. Given an image or a video and a text prompt, our goal is to edit the appearance of existing objects (e.g., texture) or augment the scene with visual effects (e.g., smoke, fire) in a semantic manner. We train a generator on an internal dataset, extracted from a single input, while leveraging an external pretrained CLIP model to impose our losses. Rather than directly generating the edited output, our key idea is to generate an edit layer (color+opacity) that is composited over the input. This allows us to control the generation and maintain high fidelity to the input via novel text-driven losses applied directly to the edit layer. Our method neither relies on a pretrained generator nor requires user-provided masks. We demonstrate localized, semantic edits on high-resolution images and videos across a variety of objects and scenes. Webpage: http://www.text2live.github.io.

O. Bar-Tal, D. Ofri-Amar and R. Fridman—Have contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    [5] works with \(224\times 224\) images, so we resize \(I_s\) and \(\alpha \) before applying loss (8).

References

  1. Disco Diffusion. https://colab.research.google.com/github/alembics/disco-diffusion/blob/main/Disco_Diffusion.ipynb

  2. Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  3. Bau, D., et al.: Paint by word. arXiv preprint arXiv:2103.10951 (2021)

  4. Brinkmann, R.: The Art and Science of Digital Compositing: Techniques for Visual Effects, Animation and Motion Graphics. Morgan Kaufmann, Burlington (2008)

    Google Scholar 

  5. Chefer, H., Gur, S., Wolf, L.: Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

    Google Scholar 

  6. Crowson, K.: VQGAN+CLIP. https://colab.research.google.com/github/justinjohn0306/VQGAN-CLIP/blob/main/VQGAN%2BCLIP(Updated).ipynb

  7. Dong, H., Yu, S., Wu, C., Guo, Y.: Semantic image synthesis via adversarial learning. In: Proceedings of the IEEE International Conference on Computer Vision, ICCV (2017)

    Google Scholar 

  8. Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  9. Frans, K., Soros, L., Witkowski, O.: CLIPDraw: exploring text-to-drawing synthesis through language-image encoders. arXiv preprint arXiv:2106.14843 (2021)

  10. Gal, R., Patashnik, O., Maron, H., Chechik, G., Cohen-Or, D.: StyleGAN-NADA: CLIP-guided domain adaptation of image generators. arXiv preprint arXiv:2108.00946 (2021)

  11. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

    Google Scholar 

  12. Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  13. Jamriška, O., et al.: Stylizing video by example. ACM Trans. Graph. 38, 1–11 (2019)

    Article  Google Scholar 

  14. Jing, Y., Yang, Y., Feng, Z., Ye, J., Yu, Y., Song, M.: Neural style transfer: a review. IEEE Trans. Visual Comput. Graphics 26(11), 3365–3385 (2019)

    Article  Google Scholar 

  15. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  16. Kasten, Y., Ofri, D., Wang, O., Dekel, T.: Layered neural atlases for consistent video editing. ACM Trans. Graph. (TOG) 40(6), 1–12 (2021)

    Article  Google Scholar 

  17. Kim, G., Ye, J.C.: DiffusionCLIP: text-guided image manipulation using diffusion models. arXiv preprint arXiv:2110.02711 (2021)

  18. Kolkin, N.I., Salavon, J., Shakhnarovich, G.: Style transfer by relaxed optimal transport and self-similarity. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  19. Kwon, G., Ye, J.C.: CLIPstyler: image style transfer with a single text condition. arXiv preprint arXiv:2112.00374 (2021)

  20. Li, B., Qi, X., Lukasiewicz, T., Torr, P.H.: ManiGAN: text-guided image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  21. Lin, S., Fisher, M., Dai, A., Hanrahan, P.: LayerBuilder: layer decomposition for interactive image and video color editing. arXiv preprint arXiv:1701.03754 (2017)

  22. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  23. Liu, X., Gong, C., Wu, L., Zhang, S., Su, H., Liu, Q.: FuseDream: training-free text-to-image generation with improved CLIP+GAN space optimization. arXiv preprint arXiv:2112.01573 (2021)

  24. Lu, E., et al.: Layered neural rendering for retiming people in video. ACM Trans. Graph. (2020)

    Google Scholar 

  25. Lu, E., Cole, F., Dekel, T., Zisserman, A., Freeman, W.T., Rubinstein, M.: Omnimatte: associating objects and their effects in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  26. Michel, O., Bar-On, R., Liu, R., Benaim, S., Hanocka, R.: Text2Mesh: text-driven neural stylization for meshes. arXiv preprint arXiv:2112.03221 (2021)

  27. Nam, S., Kim, Y., Kim, S.J.: Text-adaptive generative adversarial networks: manipulating images with natural language. In: Advances in Neural Information Processing Systems (NeurIPS) (2018)

    Google Scholar 

  28. Nichol, A., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)

  29. Park, T., et al.: Swapping autoencoder for deep image manipulation. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

    Google Scholar 

  30. Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., Lischinski, D.: StyleCLIP: text-driven manipulation of StyleGAN imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

    Google Scholar 

  31. Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 DAVIS challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)

  32. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML) (2021)

    Google Scholar 

  33. Ramesh, A., et al.: Zero-shot text-to-image generation. In: Proceedings of the 38th International Conference on Machine Learning (ICML) (2021)

    Google Scholar 

  34. Rav-Acha, A., Kohli, P., Rother, C., Fitzgibbon, A.W.: Unwrap mosaics: a new representation for video editing. ACM Trans. Graph. (2008)

    Google Scholar 

  35. Reed, S.E., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: Proceedings of the 33rd International Conference on Machine Learning (ICML) (2016)

    Google Scholar 

  36. Richardson, E., et al.: Encoding in style: a StyleGAN encoder for image-to-image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  37. Ruder, M., Dosovitskiy, A., Brox, T.: Artistic style transfer for videos. In: Rosenhahn, B., Andres, B. (eds.) GCPR 2016. LNCS, vol. 9796, pp. 26–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45886-1_3

    Chapter  Google Scholar 

  38. Shaham, T.R., Dekel, T., Michaeli, T.: SinGAN: learning a generative model from a single natural image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  39. Shechtman, E., Irani, M.: Matching local self-similarities across images and videos. In: 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) (2007)

    Google Scholar 

  40. Shocher, A., Bagon, S., Isola, P., Irani, M.: InGAN: capturing and retargeting the “DNA” of a natural image. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  41. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  42. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: 9th International Conference on Learning Representations (ICLR) (2021)

    Google Scholar 

  43. Texler, O., et al.: Interactive video stylization using few-shot patch-based training. ACM Trans. Graph. 39(4), 73:1 (2020)

    Article  Google Scholar 

  44. Tov, O., Alaluf, Y., Nitzan, Y., Patashnik, O., Cohen-Or, D.: Designing an encoder for StyleGAN image manipulation. ACM Trans. Graph. (TOG) 40(4), 1–14 (2021)

    Article  Google Scholar 

  45. Tumanyan, N., Bar-Tal, O., Bagon, S., Dekel, T.: Splicing ViT features for semantic appearance transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)

    Google Scholar 

  46. Xia, W., Zhang, Y., Yang, Y., Xue, J.H., Zhou, B., Yang, M.H.: GAN inversion: a survey. arXiv preprint arXiv:2101.05278 (2021)

  47. Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  48. Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)

    Google Scholar 

  49. Zhang, H., et al.: StackGAN++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1947–1962 (2019)

    Article  Google Scholar 

Download references

Acknowledgements

We thank Kfir Aberman, Lior Yariv, Shai Bagon for reviewing early drafts; Narek Tumanyan for assisting with the user evaluation. This project received funding from the Israeli Science Foundation (grant 2303/20).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Omer Bar-Tal .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., Dekel, T. (2022). Text2LIVE: Text-Driven Layered Image and Video Editing. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13675. Springer, Cham. https://doi.org/10.1007/978-3-031-19784-0_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19784-0_41

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19783-3

  • Online ISBN: 978-3-031-19784-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics