Advertisement

Open-Edit: Open-Domain Image Manipulation with Open-Vocabulary Instructions

Conference paper
  • 525 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12356)

Abstract

We propose a novel algorithm, named Open-Edit, which is the first attempt on open-domain image manipulation with open-vocabulary instructions. It is a challenging task considering the large variation of image domains and the lack of training supervision. Our approach takes advantage of the unified visual-semantic embedding space pretrained on a general image-caption dataset, and manipulates the embedded visual features by applying text-guided vector arithmetic on the image feature maps. A structure-preserving image decoder then generates the manipulated images from the manipulated feature maps. We further propose an on-the-fly sample-specific optimization approach with cycle-consistency constraints to regularize the manipulated images and force them to preserve details of the source images. Our approach shows promising results in manipulating open-vocabulary color, texture, and high-level attributes for various scenarios of open-domain images (Code is released at https://github.com/xh-liu/Open-Edit).

References

  1. 1.
    Ak, K.E., Lim, J.H., Tham, J.Y., Kassim, A.A.: Attribute manipulation generative adversarial networks for fashion images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 10541–10550 (2019)Google Scholar
  2. 2.
    Almahairi, A., Rajeswar, S., Sordoni, A., Bachman, P., Courville, A.: Augmented cyclegan: learning many-to-many mappings from unpaired data. arXiv preprint arXiv:1802.10151 (2018)
  3. 3.
    Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 898–916 (2010)CrossRefGoogle Scholar
  4. 4.
    Bau, D., et al.: Semantic photo manipulation with a generative image prior. ACM Trans. Graph. (TOG) 38(4), 59 (2019)CrossRefGoogle Scholar
  5. 5.
    Bau, D., et al.: GAN dissection: visualizing and understanding generative adversarial networks. arXiv preprint arXiv:1811.10597 (2018)
  6. 6.
    Brock, A., Lim, T., Ritchie, J.M., Weston, N.: Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093 (2016)
  7. 7.
    Chen, D., et al.: Improving deep visual representation for person re-identification by global and local image-language association. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 56–73. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01270-0_4CrossRefGoogle Scholar
  8. 8.
    Chen, J., Shen, Y., Gao, J., Liu, J., Liu, X.: Language-based image editing with recurrent attentive models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8721–8729 (2018)Google Scholar
  9. 9.
    Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1511–1520 (2017)Google Scholar
  10. 10.
    Chen, Y.C., et al.: Semantic component decomposition for face attribute manipulation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9859–9867 (2019)Google Scholar
  11. 11.
    Cheng, Y., Gan, Z., Li, Y., Liu, J., Gao, J.: Sequential attention gan for interactive image editing via dialogue. arXiv preprint arXiv:1812.08352 (2018)
  12. 12.
    Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797 (2018)Google Scholar
  13. 13.
    Dong, H., Yu, S., Wu, C., Guo, Y.: Semantic image synthesis via adversarial learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5706–5714 (2017)Google Scholar
  14. 14.
    El-Nouby, A., et al.: Keep drawing it: iterative language-based image generation and editing. arXiv preprint arXiv:1811.09845 (2018)
  15. 15.
    Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017)
  16. 16.
    Günel, M., Erdem, E., Erdem, A.: Language guided fashion image manipulation with feature-wise transformations. arXiv preprint arXiv:1808.04000 (2018)
  17. 17.
    He, J., Zhang, S., Yang, M., Shan, Y., Huang, T.: Bi-directional cascade network for perceptual edge detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3828–3837 (2019)Google Scholar
  18. 18.
    Huang, X., Liu, M.-Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 179–196. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01219-9_11CrossRefGoogle Scholar
  19. 19.
    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)Google Scholar
  20. 20.
    Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948 (2018)
  21. 21.
    Kim, T., Cha, M., Kim, H., Lee, J.K., Kim, J.: Learning to discover cross-domain relations with generative adversarial networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 1857–1865. JMLR. org (2017)Google Scholar
  22. 22.
    Lee, H.-Y., Tseng, H.-Y., Huang, J.-B., Singh, M., Yang, M.-H.: Diverse image-to-image translation via disentangled representations. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 36–52. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01246-5_3CrossRefGoogle Scholar
  23. 23.
    Li, B., Qi, X., Lukasiewicz, T., Torr, P.H.: Manigan: text-guided image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7880–7889 (2020)Google Scholar
  24. 24.
    Liu, M.Y., Tuzel, O.: Coupled generative adversarial networks. In: Advances in Neural Information Processing Systems, pp. 469–477 (2016)Google Scholar
  25. 25.
    Liu, X., Li, H., Shao, J., Chen, D., Wang, X.: Show, tell and discriminate: image captioning by self-retrieval with partially labeled data. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 353–369. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01267-0_21CrossRefGoogle Scholar
  26. 26.
    Liu, X., Wang, Z., Shao, J., Wang, X., Li, H.: Improving referring expression grounding with cross-modal attention-guided erasing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1950–1959 (2019)Google Scholar
  27. 27.
    Liu, X., Yin, G., Shao, J., Wang, X., Li, H.: Learning to predict layout-to-image conditional convolutions for semantic image synthesis. In: Advances in Neural Information Processing Systems, pp. 570–580 (2019)Google Scholar
  28. 28.
    Mao, X., Chen, Y., Li, Y., Xiong, T., He, Y., Xue, H.: Bilinear representation for language-based image editing using conditional generative adversarial networks. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2047–2051. IEEE (2019)Google Scholar
  29. 29.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  30. 30.
    Mo, S., Cho, M., Shin, J.: Instagan: instance-aware image-to-image translation. arXiv preprint arXiv:1812.10889 (2018)
  31. 31.
    Nam, S., Kim, Y., Kim, S.J.: Text-adaptive generative adversarial networks: manipulating images with natural language. In: Advances in Neural Information Processing Systems, pp. 42–51 (2018)Google Scholar
  32. 32.
    Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. arXiv preprint arXiv:1903.07291 (2019)
  33. 33.
    Perarnau, G., Van De Weijer, J., Raducanu, B., Álvarez, J.M.: Invertible conditional GANs for image editing. arXiv preprint arXiv:1611.06355 (2016)
  34. 34.
    Royer, A., et al.: Xgan: unsupervised image-to-image translation for many-to-many mappings. arXiv preprint arXiv:1711.05139 (2017)
  35. 35.
    Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of ACL (2018)Google Scholar
  36. 36.
    Shen, Y., Gu, J., Tang, X., Zhou, B.: Interpreting the latent space of GANs for semantic face editing. arXiv preprint arXiv:1907.10786 (2019)
  37. 37.
    Shu, Z., Yumer, E., Hadap, S., Sunkavalli, K., Shechtman, E., Samaras, D.: Neural face editing with intrinsic image disentangling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5541–5550 (2017)Google Scholar
  38. 38.
    Taigman, Y., Polyak, A., Wolf, L.: Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200 (2016)
  39. 39.
    Usman, B., Dufour, N., Saenko, K., Bregler, C.: PuppetGAN: cross-domain image manipulation by demonstration. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9450–9458 (2019)Google Scholar
  40. 40.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)Google Scholar
  41. 41.
    Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807 (2018)Google Scholar
  42. 42.
    Wang, Z., et al.: Camp: cross-modal adaptive message passing for text-image retrieval. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5764–5773 (2019)Google Scholar
  43. 43.
    Xiao, T., Hong, J., Ma, J.: ELEGANT: exchanging latent encodings with GAN for transferring multiple face attributes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 172–187. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01249-6_11CrossRefGoogle Scholar
  44. 44.
    Yi, Z., Zhang, H., Tan, P., Gong, M.: DualGAN: unsupervised dual learning for image-to-image translation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2849–2857 (2017)Google Scholar
  45. 45.
    Yu, X., Chen, Y., Liu, S., Li, T., Li, G.: Multi-mapping image-to-image translation via learning disentanglement. In: Advances in Neural Information Processing Systems, pp. 2994–3004 (2019)Google Scholar
  46. 46.
    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)Google Scholar
  47. 47.
    Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)Google Scholar
  48. 48.
    Zhu, J.-Y., Krähenbühl, P., Shechtman, E., Efros, A.A.: Generative visual manipulation on the natural image manifold. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 597–613. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46454-1_36CrossRefGoogle Scholar
  49. 49.
    Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)Google Scholar
  50. 50.
    Zhu, J.Y., et al.: Toward multimodal image-to-image translation. In: Advances in Neural Information Processing Systems, pp. 465–476 (2017)Google Scholar
  51. 51.
    Zhu, S., Urtasun, R., Fidler, S., Lin, D., Change Loy, C.: Be your own prada: fashion synthesis with structural coherence. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1680–1688 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.The Chinese University of Hong KongHong KongChina
  2. 2.Adobe ResearchSan JoseUSA

Personalised recommendations