Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors

Mao, Qi; Tseng, Hung-Yu; Lee, Hsin-Ying; Huang, Jia-Bin; Ma, Siwei; Yang, Ming-Hsuan

doi:10.1007/s11263-021-01557-6

Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors

Published: 06 January 2022

Volume 130, pages 517–549, (2022)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Qi Mao¹,
Hung-Yu Tseng²,
Hsin-Ying Lee³,
Jia-Bin Huang⁴,
Siwei Ma⁵ &
…
Ming-Hsuan Yang ORCID: orcid.org/0000-0003-4848-2304^6,7,8

999 Accesses
11 Citations
1 Altmetric
Explore all metrics

Abstract

Recent image-to-image (I2I) translation algorithms focus on learning the mapping from a source to a target domain. However, the continuous translation problem that synthesizes intermediate results between two domains has not been well-studied in the literature. Generating a smooth sequence of intermediate results bridges the gap of two different domains, facilitating the morphing effect across domains. Existing I2I approaches are limited to either intra-domain or deterministic inter-domain continuous translation. In this work, we present an effectively signed attribute vector, which enables continuous translation on diverse mapping paths across various domains. In particular, we introduce a unified attribute space shared by all domains that utilize the sign operation to encode the domain information, thereby allowing the interpolation on attribute vectors of different domains. To enhance the visual quality of continuous translation results, we generate a trajectory between two sign-symmetrical attribute vectors and leverage the domain information of the interpolated results along the trajectory for adversarial training. We evaluate the proposed method on a wide range of I2I translation tasks. Both qualitative and quantitative results demonstrate that the proposed framework generates more high-quality continuous translation results against the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 12

A survey on Image Data Augmentation for Deep Learning

Article Open access 06 July 2019

Image Matching from Handcrafted to Deep Features: A Survey

Article Open access 04 August 2020

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

References

Abdal, R., Qin, Y., & Wonka, P. (2019). Image2stylegan: How to embed images into the stylegan latent space? In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4432–4441.
Abdal, R., Qin, Y., & Wonka, P. (2020). Image2stylegan++: How to edit the embedded images? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8296–8305.
Anoosheh, A., Agustsson, E., Timofte, R., & Van Gool, L. (2018). Combogan: Unrestrained scalability for image domain translation. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 783–790.
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv:1607.06450
Brock, A., Donahue, J., & Simonyan, K. (2018). Large scale gan training for high fidelity natural image synthesis. In International conference on learning representations.
Burkov, E., Pasechnik, I., Grigorev, A., & Lempitsky, V. (2020). Neural head reenactment with latent pose descriptors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13786–13795.
Chen, Y. C., Lin, Y. Y., Yang, M. H., & Huang, J. B. (2019). Crdoco: Pixel-level domain transfer with cross-domain consistency. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1791–1800.
Chen, Y. C., Xu, X., Tian, Z., & Jia, J. (2019). Homomorphic latent space interpolation for unpaired image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2408–2416.
Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., & Choo, J. (2018). Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8789–8797.
Choi, Y., Uh, Y., Yoo, J., & Ha, J. W. (2020). Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8188–8197.
Gong, R., Li, W., Chen, Y., & Gool, L. V. (2019). Dlow: Domain flow for adaptation and generalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2477–2486.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680 (2014).
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems , pp. 6629–6640.
Hoffman, J., Tzeng, E., Park, T., Zhu, J. Y., Isola, P., Saenko, K., Efros, A., & Darrell, T. (2018). Cycada: Cycle-consistent adversarial domain adaptation. In International conference on machine learning (pp. 1989–1998). PMLR.
Huang, X., & Belongie, S. (2017). Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pp. 1501–1510.
Huang, X., Liu, M. Y., Belongie, S., & Kautz, J. (2018). Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV), pp. 172–189.
Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134.
Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision (pp. 694–711). Springer.
Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2018). Progressive growing of gans for improved quality, stability, and variation. In International conference on learning representations.
Kim, T., Cha, M., Kim, H., Lee, J. K., & Kim, J. (2017). Learning to discover cross-domain relations with generative adversarial networks. In International conference on machine learning (pp. 1857–1865). PMLR.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In International conference on learning representations.
Kotovenko, D., Sanakoyeu, A., Lang, S., & Ommer, B. (2019). Content and style disentanglement for artistic style transfer. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4422–4431.
Lample, G., Zeghidour, N., Usunier, N., Bordes, A., Denoyer, L., & Ranzato, M. (2017). Fader networks: Manipulating images by sliding attributes. Advances in Neural Information Processing Systems, 30, 5967–5976.
Google Scholar
Lee, H. Y., Tseng, H. Y., Huang, J. B., Singh, M., & Yang, M. H. (2018). Diverse image-to-image translation via disentangled representations. In Proceedings of the European conference on computer vision, (pp. 35–51). Springer.
Lee, H. Y., Tseng, H. Y., Mao, Q., Huang, J. B., Lu, Y. D., Singh, M., & Yang, M. H. (2020). Drit++: Diverse image-to-image translation via disentangled representations. International Journal of Computer Vision, 128(10), 2402–2417.
Article Google Scholar
Liao, J., Lima, R. S., Nehab, D., Hoppe, H., Sander, P. V., & Yu, J. (2014). Automating image morphing using structural similarity on a halfway domain. ACM Transactions on Graphics (TOG), 33(5), 1–12.
Article Google Scholar
Lira, W., Merz, J., Ritchie, D., Cohen-Or, D., & Zhang, H. (2020). Ganhopper: Multi-hop gan for unsupervised image-to-image translation. In European conference on computer vision, pp. 363–379. Springer.
Liu, M. Y., Breuel, T., & Kautz, J. (2017). Unsupervised image-to-image translation networks. In Advances in neural information processing systems, pp. 700–708.
Liu, M. Y., Huang, X., Mallya, A., Karras, T., Aila, T., Lehtinen, J., & Kautz, J. (2019). Few-shot unsupervised image-to-image translation. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10551–10560.
Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier nonlinearities improve neural network acoustic models. In International conference on machine learning (Vol. 30, p. 3). Citeseer.
Mao, Q., Lee, H. Y., Tseng, H. Y., Ma, S., & Yang, M. H. (2019). Mode seeking generative adversarial networks for diverse image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1429–1437.
Mescheder, L., Geiger, A., & Nowozin, S. (2018). Which training methods for gans do actually converge? In International conference on machine learning (pp. 3481–3490). PMLR.
Park, T., Liu, M. Y., Wang, T. C., & Zhu, J. Y. (2019). Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2337–2346.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in pytorch. In Advances in neural information processing systems workshops.
Saito, K., Saenko, K., & Liu, M. Y. (2020). Coco-funit: Few-shot unsupervised image translation with a content conditioned style encoder. In Computer vision—ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, Part III (Vol. 16, pp. 382–398). Springer.
Shen, Y., Gu, J., Tang, X., & Zhou, B. (2020). Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9243–9252.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International conference on learning representations.
Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016). Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022
Voynov, A., & Babenko, A. (2020). Unsupervised discovery of interpretable directions in the gan latent space. In International conference on machine learning (pp. 9786–9796). PMLR.
Wang, T. C., Liu, M. Y., Zhu, J. Y., Tao, A., Kautz, J., & Catanzaro, B. (2018). High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807.
Wolberg, G. (1998). Image morphing: A survey. The Visual Computer, 14(8), 360–372.
Article Google Scholar
Wu, P. W., Lin, Y. J., Chang, C. H., Chang, E. Y., & Liao, S. W. (2019). Relgan: Multi-domain image-to-image translation via relative attributes. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 5914–5922.
Wu, W., Cao, K., Li, C., Qian, C., & Loy, C.C. (2019). Transgaga: Geometry-aware unsupervised image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8012–8021.
Yu, X., Chen, Y., Liu, S., Li, T., & Li, G. (2019). Multi-mapping image-to-image translation via learning disentanglement. Advances in Neural Information Processing Systems, 32, 2994–3004.
Google Scholar
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595.
Zhao, S., Song, J., & Ermon, S. (2017). Infovae: Information maximizing variational autoencoders. arXiv:1706.02262
Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 2223–2232.
Zhu, J. Y., Zhang, R., Pathak, D., Darrell, T., Efros, A. A., Wang, O., & Shechtman, E. (2017). Toward multimodal image-to-image translation. In Advances in neural information processing systems, pp. 465–476.

Download references

Acknowledgements

Q. Mao and S. Ma are supported in part by the National Natural Science Foundation of China (62025101), China Scholarship Council for 1 year visiting at the University of California at Merced, and High-performance Computing Platform of Peking University, State Key Laboratory of Media Convergence and Communication (Communication University of China), and Fundamental Research Funds for the Central Universities. H.-Y. Lee, H.-Y. Tseng and M.-H. Yang are supported in part by NSF CAREER grant 1149783.

Author information

Authors and Affiliations

State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, 100024, China
Qi Mao
Meta, Menlo Park, CA, 94025, USA
Hung-Yu Tseng
Snap Research, Santa Monica, CA, 90405, USA
Hsin-Ying Lee
Computer Science, University of Maryland, College Park, MD, 20742, USA
Jia-Bin Huang
Institute of Digital Media, School of Electronics Engineering and Computer Science, Peking University, Beijing, 100871, China
Siwei Ma
University of California at Merced, Merced, CA, USA
Ming-Hsuan Yang
Yonsei University, Seoul, Korea
Ming-Hsuan Yang
Google, Mountain View, CA, 94043, USA
Ming-Hsuan Yang

Authors

Qi Mao
View author publications
You can also search for this author in PubMed Google Scholar
Hung-Yu Tseng
View author publications
You can also search for this author in PubMed Google Scholar
Hsin-Ying Lee
View author publications
You can also search for this author in PubMed Google Scholar
Jia-Bin Huang
View author publications
You can also search for this author in PubMed Google Scholar
Siwei Ma
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Hsuan Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ming-Hsuan Yang.

Additional information

Communicated by Maja Pantic.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Additional Experiments

1.1 A.1 More Continuous Translation Results

We present more diverse continuous translation paths from the source domain to the target one in this section. We can continuously translate an input image \(I_s\) in the source domain to multiple images \(I_{t_1}, I_{t_2}, \dots , I_{t_N}\) in the target domain. We name this type as one input and various targets. The target attribute vector can be obtained from either extracting a reference image sampled from the target domain (reference-guided) or randomly generating an SAV of the target domain (latent-guided).

Reference-Guided Continuous Translation Figures 13, 14, 15, and 16 present more reference-guided continuous translation in both style translation and shape-variation tasks.

Latent-Guided Continuous Translation Figures 17 and 18 show more latent-guided continuous translation results in both style translation and shape-variation tasks, which uses randomly sampled signed attribute vectors of target domains.

1.2 A.2 Multiple Translation Paths using Diverse Intermediate Points

Given a source image \(I_s\) and a target image \(I_t\), we can obtain multiple continuous translation paths by passing different intermediate attribute vectors, called one input, one target, and diverse intermediate points. In particular, we can apply interpolation with multiple intermediate points between a source attribute vector and a target attribute vector to generate multiple translation paths, as presented in Fig. 19.

1.3 A.3 More Results on Facial Expression Continuous Translation

We present more results of facial expression continuous translation on the CelebA-HQ dataset. Figures 20 and 21 show the reference-guided and latent-guided continuous translation results, respectively.

1.4 A.4 More Comparisons with State-of-the-arts

We present more quantitative and qualitative comparisons results in Figs. 22 and 23.

1.5 A.5 More Ablation Studies

We conduct additional ablation studies on other loss objectives commonly used for image-to-image translation in Fig. 24.

Cycle-Consistency Loss Both quantitative and qualitative results demonstrate that \({\mathcal {L}}_{1}^{\mathrm {cc}}\) helps preserve the consistency of domain-invariant characteristics of generated images. Without \({\mathcal {L}}_{1}^{\mathrm {cc}}\), the model achieves the highest DIPD value when \(\beta = 1.0\), as illustrated in the “DIPD vs. \(\beta \)” curve of Fig. 24b. Figure 24a also shows that the expression of generated image changes from the neutral into the smile when \(\beta >0.4\).

Self-Reconstruction Loss The model trained without \({\mathcal {L}}_{1}^{\mathrm {recon}}\) cannot reconstruct the original image well, as shown in Fig. 24a at \(\beta = 0\). Therefore, the “DIPD vs. \(\beta \)” curve of Fig. 24b demonstrates that it obtains the highest DIPD value when \(\beta = 0\).

Content Adversarial Loss The “FID vs. \(\beta \)” curve of Fig. 24b shows that the full model has better FID values than the one without \({\mathcal {L}}_{\mathrm {adv}}^{\mathrm {content}}\). We also observe that the full model aligns content representations of two domains better than that trained without \({\mathcal {L}}_{\mathrm {adv}}^{\mathrm {content}}\), as shown in Fig. 25. Thus, the content discriminator helps align the distribution of the content representations of two domains and further disentangles the content and attribute representations.

Latent Regression and Mode Seeking Constraints To better illustrate the effectiveness of \({\mathcal {L}}_{1}^{\mathrm {latent}}\) and \({\mathcal {L}}_{\mathrm {ms}}\), we calculate the LPIPS score of generated images on the male \(\rightarrow \) female translation for diversity comparisons. As shown in Table 3, both \({\mathcal {L}}_{1}^{\mathrm {latent}}\) and \({\mathcal {L}}_\mathrm {{ms}}\) facilitate improving the diversity of generated images. In addition, we also observe that the model trained without \({\mathcal {L}}_\mathrm {{ms}}\) cannot capture the style of the reference image for the translated image, as shown in Fig. 24a. The “FID vs. \(\beta \)” curve of Fig. 24b further indicates that training with \({\mathcal {L}}_{\mathrm {ms}}\) enhances the quality of generated images.

Table 3 Diversity comparisons on the male \(\rightarrow \) female translation

Full size table

Appendix B: Network Architecture

Table 4 shows the network configuration details where Conv(k, s, p) and DeConv(k, s, p) denote the convolutional layer and transposed convolutional layer with k as kernel size, s as stride, and p as padding; DownResBlock and UpResBlock adopt the average pooling for down-sampling and the nearest-neighbor interpolation for up-sampling respectively; LN is the layer normalization (Ba et al. 2016) and IN is the instance normalization (Ulyanov et al. 2016); AdaIN is the adaptive instance normalization (Huang and Belongie 2017); and LReLU indicates leaky ReLU (Maas et al. 2013) with a negative slope of 0.2.

Table 4 Details of network architectures: (a) content encoder \(E_c\) architecture, (b) generator G architecture. Style translation tasks directly concatenate the attribute vector, while shape-variation translation tasks adopt AdaIN to inject the attribute vector, (c) attribute encoder \(E_a\) and discriminator D architecture. The d is set to 8 and 1 for \(E_a\) and D respectively, (d) content discriminator \(D_c\) architecture. The c is set to 256 and 512 for style translation, and shape-variation translation tasks, respectively, and shape-variation translation tasks do not contain the first DownResBlock, (e) fusing network F architecture

Full size table

Appendix C: The Video

We present translation videos at https://helenmao.github.io/SAVI2I/. The video shows some examples of generating diverse animations from a source input image. In particular, this is accomplished by continuously translating the input image to the target domains. We present reference-guided continuous translation and latent-guided continuous translation of the proposed method.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mao, Q., Tseng, HY., Lee, HY. et al. Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors. Int J Comput Vis 130, 517–549 (2022). https://doi.org/10.1007/s11263-021-01557-6

Download citation

Received: 29 March 2021
Accepted: 19 November 2021
Published: 06 January 2022
Issue Date: February 2022
DOI: https://doi.org/10.1007/s11263-021-01557-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors

Abstract

Access this article

Similar content being viewed by others

A survey on Image Data Augmentation for Deep Learning

Image Matching from Handcrafted to Deep Features: A Survey

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

References

Acknowledgements