Text2LIVE: Text-Driven Layered Image and Video Editing

Bar-Tal, Omer; Ofri-Amar, Dolev; Fridman, Rafail; Kasten, Yoni; Dekel, Tali

doi:10.1007/978-3-031-19784-0_41

Omer Bar-Tal¹²,
Dolev Ofri-Amar¹²,
Rafail Fridman¹²,
Yoni Kasten¹³ &
…
Tali Dekel¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13675))

Included in the following conference series:

European Conference on Computer Vision

3342 Accesses
31 Citations

Abstract

We present a method for zero-shot, text-driven editing of natural images and videos. Given an image or a video and a text prompt, our goal is to edit the appearance of existing objects (e.g., texture) or augment the scene with visual effects (e.g., smoke, fire) in a semantic manner. We train a generator on an internal dataset, extracted from a single input, while leveraging an external pretrained CLIP model to impose our losses. Rather than directly generating the edited output, our key idea is to generate an edit layer (color+opacity) that is composited over the input. This allows us to control the generation and maintain high fidelity to the input via novel text-driven losses applied directly to the edit layer. Our method neither relies on a pretrained generator nor requires user-provided masks. We demonstrate localized, semantic edits on high-resolution images and videos across a variety of objects and scenes. Webpage: http://www.text2live.github.io.

O. Bar-Tal, D. Ofri-Amar and R. Fridman—Have contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
[5] works with \(224\times 224\) images, so we resize \(I_s\) and \(\alpha \) before applying loss (8).

References

Disco Diffusion. https://colab.research.google.com/github/alembics/disco-diffusion/blob/main/Disco_Diffusion.ipynb
Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Bau, D., et al.: Paint by word. arXiv preprint arXiv:2103.10951 (2021)
Brinkmann, R.: The Art and Science of Digital Compositing: Techniques for Visual Effects, Animation and Motion Graphics. Morgan Kaufmann, Burlington (2008)
Google Scholar
Chefer, H., Gur, S., Wolf, L.: Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Crowson, K.: VQGAN+CLIP. https://colab.research.google.com/github/justinjohn0306/VQGAN-CLIP/blob/main/VQGAN%2BCLIP(Updated).ipynb
Dong, H., Yu, S., Wu, C., Guo, Y.: Semantic image synthesis via adversarial learning. In: Proceedings of the IEEE International Conference on Computer Vision, ICCV (2017)
Google Scholar
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Frans, K., Soros, L., Witkowski, O.: CLIPDraw: exploring text-to-drawing synthesis through language-image encoders. arXiv preprint arXiv:2106.14843 (2021)
Gal, R., Patashnik, O., Maron, H., Chechik, G., Cohen-Or, D.: StyleGAN-NADA: CLIP-guided domain adaptation of image generators. arXiv preprint arXiv:2108.00946 (2021)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)
Google Scholar
Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Jamriška, O., et al.: Stylizing video by example. ACM Trans. Graph. 38, 1–11 (2019)
Article Google Scholar
Jing, Y., Yang, Y., Feng, Z., Ye, J., Yu, Y., Song, M.: Neural style transfer: a review. IEEE Trans. Visual Comput. Graphics 26(11), 3365–3385 (2019)
Article Google Scholar
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Kasten, Y., Ofri, D., Wang, O., Dekel, T.: Layered neural atlases for consistent video editing. ACM Trans. Graph. (TOG) 40(6), 1–12 (2021)
Article Google Scholar
Kim, G., Ye, J.C.: DiffusionCLIP: text-guided image manipulation using diffusion models. arXiv preprint arXiv:2110.02711 (2021)
Kolkin, N.I., Salavon, J., Shakhnarovich, G.: Style transfer by relaxed optimal transport and self-similarity. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Kwon, G., Ye, J.C.: CLIPstyler: image style transfer with a single text condition. arXiv preprint arXiv:2112.00374 (2021)
Li, B., Qi, X., Lukasiewicz, T., Torr, P.H.: ManiGAN: text-guided image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Lin, S., Fisher, M., Dai, A., Hanrahan, P.: LayerBuilder: layer decomposition for interactive image and video color editing. arXiv preprint arXiv:1701.03754 (2017)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, X., Gong, C., Wu, L., Zhang, S., Su, H., Liu, Q.: FuseDream: training-free text-to-image generation with improved CLIP+GAN space optimization. arXiv preprint arXiv:2112.01573 (2021)
Lu, E., et al.: Layered neural rendering for retiming people in video. ACM Trans. Graph. (2020)
Google Scholar
Lu, E., Cole, F., Dekel, T., Zisserman, A., Freeman, W.T., Rubinstein, M.: Omnimatte: associating objects and their effects in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Michel, O., Bar-On, R., Liu, R., Benaim, S., Hanocka, R.: Text2Mesh: text-driven neural stylization for meshes. arXiv preprint arXiv:2112.03221 (2021)
Nam, S., Kim, Y., Kim, S.J.: Text-adaptive generative adversarial networks: manipulating images with natural language. In: Advances in Neural Information Processing Systems (NeurIPS) (2018)
Google Scholar
Nichol, A., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
Park, T., et al.: Swapping autoencoder for deep image manipulation. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)
Google Scholar
Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., Lischinski, D.: StyleCLIP: text-driven manipulation of StyleGAN imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 DAVIS challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML) (2021)
Google Scholar
Ramesh, A., et al.: Zero-shot text-to-image generation. In: Proceedings of the 38th International Conference on Machine Learning (ICML) (2021)
Google Scholar
Rav-Acha, A., Kohli, P., Rother, C., Fitzgibbon, A.W.: Unwrap mosaics: a new representation for video editing. ACM Trans. Graph. (2008)
Google Scholar
Reed, S.E., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: Proceedings of the 33rd International Conference on Machine Learning (ICML) (2016)
Google Scholar
Richardson, E., et al.: Encoding in style: a StyleGAN encoder for image-to-image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Ruder, M., Dosovitskiy, A., Brox, T.: Artistic style transfer for videos. In: Rosenhahn, B., Andres, B. (eds.) GCPR 2016. LNCS, vol. 9796, pp. 26–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45886-1_3
Chapter Google Scholar
Shaham, T.R., Dekel, T., Michaeli, T.: SinGAN: learning a generative model from a single natural image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Shechtman, E., Irani, M.: Matching local self-similarities across images and videos. In: 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) (2007)
Google Scholar
Shocher, A., Bagon, S., Isola, P., Irani, M.: InGAN: capturing and retargeting the “DNA” of a natural image. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: 9th International Conference on Learning Representations (ICLR) (2021)
Google Scholar
Texler, O., et al.: Interactive video stylization using few-shot patch-based training. ACM Trans. Graph. 39(4), 73:1 (2020)
Article Google Scholar
Tov, O., Alaluf, Y., Nitzan, Y., Patashnik, O., Cohen-Or, D.: Designing an encoder for StyleGAN image manipulation. ACM Trans. Graph. (TOG) 40(4), 1–14 (2021)
Article Google Scholar
Tumanyan, N., Bar-Tal, O., Bagon, S., Dekel, T.: Splicing ViT features for semantic appearance transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Google Scholar
Xia, W., Zhang, Y., Yang, Y., Xue, J.H., Zhou, B., Yang, M.H.: GAN inversion: a survey. arXiv preprint arXiv:2101.05278 (2021)
Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Zhang, H., et al.: StackGAN++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1947–1962 (2019)
Article Google Scholar

Download references

Acknowledgements

We thank Kfir Aberman, Lior Yariv, Shai Bagon for reviewing early drafts; Narek Tumanyan for assisting with the user evaluation. This project received funding from the Israeli Science Foundation (grant 2303/20).

Author information

Authors and Affiliations

Weizmann Institute of Science, Rehovot, Israel
Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman & Tali Dekel
NVIDIA Research, Tel Aviv, Israel
Yoni Kasten

Authors

Omer Bar-Tal
View author publications
You can also search for this author in PubMed Google Scholar
Dolev Ofri-Amar
View author publications
You can also search for this author in PubMed Google Scholar
Rafail Fridman
View author publications
You can also search for this author in PubMed Google Scholar
Yoni Kasten
View author publications
You can also search for this author in PubMed Google Scholar
Tali Dekel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Omer Bar-Tal .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., Dekel, T. (2022). Text2LIVE: Text-Driven Layered Image and Video Editing. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13675. Springer, Cham. https://doi.org/10.1007/978-3-031-19784-0_41

Download citation

DOI: https://doi.org/10.1007/978-3-031-19784-0_41
Published: 31 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19783-3
Online ISBN: 978-3-031-19784-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Text2LIVE: Text-Driven Layered Image and Video Editing