Skip to main content
Log in

Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Recent image-to-image (I2I) translation algorithms focus on learning the mapping from a source to a target domain. However, the continuous translation problem that synthesizes intermediate results between two domains has not been well-studied in the literature. Generating a smooth sequence of intermediate results bridges the gap of two different domains, facilitating the morphing effect across domains. Existing I2I approaches are limited to either intra-domain or deterministic inter-domain continuous translation. In this work, we present an effectively signed attribute vector, which enables continuous translation on diverse mapping paths across various domains. In particular, we introduce a unified attribute space shared by all domains that utilize the sign operation to encode the domain information, thereby allowing the interpolation on attribute vectors of different domains. To enhance the visual quality of continuous translation results, we generate a trajectory between two sign-symmetrical attribute vectors and leverage the domain information of the interpolated results along the trajectory for adversarial training. We evaluate the proposed method on a wide range of I2I translation tasks. Both qualitative and quantitative results demonstrate that the proposed framework generates more high-quality continuous translation results against the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Abdal, R., Qin, Y., & Wonka, P. (2019). Image2stylegan: How to embed images into the stylegan latent space? In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4432–4441.

  • Abdal, R., Qin, Y., & Wonka, P. (2020). Image2stylegan++: How to edit the embedded images? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8296–8305.

  • Anoosheh, A., Agustsson, E., Timofte, R., & Van Gool, L. (2018). Combogan: Unrestrained scalability for image domain translation. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 783–790.

  • Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv:1607.06450

  • Brock, A., Donahue, J., & Simonyan, K. (2018). Large scale gan training for high fidelity natural image synthesis. In International conference on learning representations.

  • Burkov, E., Pasechnik, I., Grigorev, A., & Lempitsky, V. (2020). Neural head reenactment with latent pose descriptors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13786–13795.

  • Chen, Y. C., Lin, Y. Y., Yang, M. H., & Huang, J. B. (2019). Crdoco: Pixel-level domain transfer with cross-domain consistency. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1791–1800.

  • Chen, Y. C., Xu, X., Tian, Z., & Jia, J. (2019). Homomorphic latent space interpolation for unpaired image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2408–2416.

  • Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., & Choo, J. (2018). Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8789–8797.

  • Choi, Y., Uh, Y., Yoo, J., & Ha, J. W. (2020). Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8188–8197.

  • Gong, R., Li, W., Chen, Y., & Gool, L. V. (2019). Dlow: Domain flow for adaptation and generalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2477–2486.

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680 (2014).

  • Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems , pp. 6629–6640.

  • Hoffman, J., Tzeng, E., Park, T., Zhu, J. Y., Isola, P., Saenko, K., Efros, A., & Darrell, T. (2018). Cycada: Cycle-consistent adversarial domain adaptation. In International conference on machine learning (pp. 1989–1998). PMLR.

  • Huang, X., & Belongie, S. (2017). Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pp. 1501–1510.

  • Huang, X., Liu, M. Y., Belongie, S., & Kautz, J. (2018). Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV), pp. 172–189.

  • Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134.

  • Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision (pp. 694–711). Springer.

  • Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2018). Progressive growing of gans for improved quality, stability, and variation. In International conference on learning representations.

  • Kim, T., Cha, M., Kim, H., Lee, J. K., & Kim, J. (2017). Learning to discover cross-domain relations with generative adversarial networks. In International conference on machine learning (pp. 1857–1865). PMLR.

  • Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In International conference on learning representations.

  • Kotovenko, D., Sanakoyeu, A., Lang, S., & Ommer, B. (2019). Content and style disentanglement for artistic style transfer. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4422–4431.

  • Lample, G., Zeghidour, N., Usunier, N., Bordes, A., Denoyer, L., & Ranzato, M. (2017). Fader networks: Manipulating images by sliding attributes. Advances in Neural Information Processing Systems, 30, 5967–5976.

    Google Scholar 

  • Lee, H. Y., Tseng, H. Y., Huang, J. B., Singh, M., & Yang, M. H. (2018). Diverse image-to-image translation via disentangled representations. In Proceedings of the European conference on computer vision, (pp. 35–51). Springer.

  • Lee, H. Y., Tseng, H. Y., Mao, Q., Huang, J. B., Lu, Y. D., Singh, M., & Yang, M. H. (2020). Drit++: Diverse image-to-image translation via disentangled representations. International Journal of Computer Vision, 128(10), 2402–2417.

    Article  Google Scholar 

  • Liao, J., Lima, R. S., Nehab, D., Hoppe, H., Sander, P. V., & Yu, J. (2014). Automating image morphing using structural similarity on a halfway domain. ACM Transactions on Graphics (TOG), 33(5), 1–12.

    Article  Google Scholar 

  • Lira, W., Merz, J., Ritchie, D., Cohen-Or, D., & Zhang, H. (2020). Ganhopper: Multi-hop gan for unsupervised image-to-image translation. In European conference on computer vision, pp. 363–379. Springer.

  • Liu, M. Y., Breuel, T., & Kautz, J. (2017). Unsupervised image-to-image translation networks. In Advances in neural information processing systems, pp. 700–708.

  • Liu, M. Y., Huang, X., Mallya, A., Karras, T., Aila, T., Lehtinen, J., & Kautz, J. (2019). Few-shot unsupervised image-to-image translation. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10551–10560.

  • Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier nonlinearities improve neural network acoustic models. In International conference on machine learning (Vol. 30, p. 3). Citeseer.

  • Mao, Q., Lee, H. Y., Tseng, H. Y., Ma, S., & Yang, M. H. (2019). Mode seeking generative adversarial networks for diverse image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1429–1437.

  • Mescheder, L., Geiger, A., & Nowozin, S. (2018). Which training methods for gans do actually converge? In International conference on machine learning (pp. 3481–3490). PMLR.

  • Park, T., Liu, M. Y., Wang, T. C., & Zhu, J. Y. (2019). Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2337–2346.

  • Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in pytorch. In Advances in neural information processing systems workshops.

  • Saito, K., Saenko, K., & Liu, M. Y. (2020). Coco-funit: Few-shot unsupervised image translation with a content conditioned style encoder. In Computer vision—ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, Part III (Vol. 16, pp. 382–398). Springer.

  • Shen, Y., Gu, J., Tang, X., & Zhou, B. (2020). Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9243–9252.

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International conference on learning representations.

  • Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016). Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022

  • Voynov, A., & Babenko, A. (2020). Unsupervised discovery of interpretable directions in the gan latent space. In International conference on machine learning (pp. 9786–9796). PMLR.

  • Wang, T. C., Liu, M. Y., Zhu, J. Y., Tao, A., Kautz, J., & Catanzaro, B. (2018). High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807.

  • Wolberg, G. (1998). Image morphing: A survey. The Visual Computer, 14(8), 360–372.

    Article  Google Scholar 

  • Wu, P. W., Lin, Y. J., Chang, C. H., Chang, E. Y., & Liao, S. W. (2019). Relgan: Multi-domain image-to-image translation via relative attributes. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 5914–5922.

  • Wu, W., Cao, K., Li, C., Qian, C., & Loy, C.C. (2019). Transgaga: Geometry-aware unsupervised image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8012–8021.

  • Yu, X., Chen, Y., Liu, S., Li, T., & Li, G. (2019). Multi-mapping image-to-image translation via learning disentanglement. Advances in Neural Information Processing Systems, 32, 2994–3004.

    Google Scholar 

  • Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595.

  • Zhao, S., Song, J., & Ermon, S. (2017). Infovae: Information maximizing variational autoencoders. arXiv:1706.02262

  • Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 2223–2232.

  • Zhu, J. Y., Zhang, R., Pathak, D., Darrell, T., Efros, A. A., Wang, O., & Shechtman, E. (2017). Toward multimodal image-to-image translation. In Advances in neural information processing systems, pp. 465–476.

Download references

Acknowledgements

Q. Mao and S. Ma are supported in part by the National Natural Science Foundation of China (62025101), China Scholarship Council for 1 year visiting at the University of California at Merced, and High-performance Computing Platform of Peking University, State Key Laboratory of Media Convergence and Communication (Communication University of China), and Fundamental Research Funds for the Central Universities. H.-Y. Lee, H.-Y. Tseng and M.-H. Yang are supported in part by NSF CAREER grant 1149783.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ming-Hsuan Yang.

Additional information

Communicated by Maja Pantic.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Additional Experiments

1.1 A.1 More Continuous Translation Results

We present more diverse continuous translation paths from the source domain to the target one in this section. We can continuously translate an input image \(I_s\) in the source domain to multiple images \(I_{t_1}, I_{t_2}, \dots , I_{t_N}\) in the target domain. We name this type as one input and various targets. The target attribute vector can be obtained from either extracting a reference image sampled from the target domain (reference-guided) or randomly generating an SAV of the target domain (latent-guided).

Reference-Guided Continuous Translation Figures 13, 14, 15, and 16 present more reference-guided continuous translation in both style translation and shape-variation tasks.

Latent-Guided Continuous Translation Figures 17 and 18 show more latent-guided continuous translation results in both style translation and shape-variation tasks, which uses randomly sampled signed attribute vectors of target domains.

Fig. 13
figure 13

Reference-guided continuous translation results on the CelebA-HQ dataset. Green and blue bounding boxes denote generated images of the source and target domain, respectively (Color figure online)

Fig. 14
figure 14

Reference-guided continuous translation on the AFHQ dataset. Green and blue bounding boxes denote generated images of the source and target domain, respectively (Color figure online)

Fig. 15
figure 15

Reference-guided continuous translation results on the Yosemite dataset. Green and blue bounding boxes denote generated images of the source and target domain, respectively (Color figure online)

Fig. 16
figure 16

Reference-guided continuous translation on the Photo2Artwork dataset. Green and blue bounding boxes denote generated images of the source and target domain, respectively (Color figure online)

Fig. 17
figure 17

Latent-guided continuous translation on the CelebA-HQ and AFHQ dataset. Green and blue bounding boxes denote generated images of the source and target domain, respectively (Color figure online)

Fig. 18
figure 18

Latent-guided continuous translation on the Photo2Artwork dataset. Green and blue bounding boxes denote generated images of the source and target domain, respectively (Color figure online)

Fig. 19
figure 19

One input, one target, and diverse intermediate points. The source image can be translated into one specific target with various intermediate points. We show three translation paths on cat \(\rightarrow \) wildlife task with different intermediate points

Fig. 20
figure 20

Reference-guided continuous expression translation on the CelebA-HQ dataset. Green and blue bounding boxes denote generated images of the source and target domain, respectively (Color figure online)

Fig. 21
figure 21

Latent-guided continuous expression translation on the CelebA-HQ dataset. Green and blue bounding boxes denote generated images of the source and target domain, respectively (Color figure online)

Fig. 22
figure 22

Translation from female \(\rightarrow \) male images. a Translation based on reference images (top); Translation based on interpolated domain labels or randomly sampled latent vectors (bottom). Green and blue bounding boxes denote generated images of the source and target domain, respectively. b From left to right, the y-axis of each sub-figure are target domain translation ACC (the larger, the better), target domain FID (the smaller, the better), DIPD (the smaller, the better), and LPIPS between two adjacent interpolated images (the smaller, the better). Each curve is plotted under different \(\beta \) values. Solid lines indicate methods using reference images. Dash lines denote approaches using interpolated domain labels or randomly sampled latent vectors of the target domain (Color figure online)

Fig. 23
figure 23

Translation from cat \(\rightarrow \) dog images. a Translation based on reference images (top); translation based on interpolated domain labels or randomly sampled latent vectors (bottom). Green and blue bounding boxes denote generated images of the source and target domain, respectively. b From left to right, the y-axis of each sub-figure are target domain translation ACC (the larger, the better), target domain FID (the smaller, the better), DIPD (the smaller, the better), and LPIPS between two adjacent interpolated images (the smaller, the better). Each curve is plotted under different \(\beta \) values. Solid lines indicate methods using reference images. Dash lines denote approaches using interpolated domain labels or randomly sampled latent vectors of the target domain (Color figure online)

Fig. 24
figure 24

Additional ablation studies on the male \(\rightarrow \) female translation. a Green and blue bounding boxes denote generated images of the source and target domain, respectively. b The x-axis is the \(\beta \). From left to the right, the y-axis in each sub-figure are target domain translation ACC (the larger, the better), target domain FID (the smaller, the better), DIPD (the smaller, the better), and LPIPS between two adjacent interpolated images (the smaller, the better), respectively (Color figure online)

1.2 A.2 Multiple Translation Paths using Diverse Intermediate Points

Given a source image \(I_s\) and a target image \(I_t\), we can obtain multiple continuous translation paths by passing different intermediate attribute vectors, called one input, one target, and diverse intermediate points. In particular, we can apply interpolation with multiple intermediate points between a source attribute vector and a target attribute vector to generate multiple translation paths, as presented in Fig. 19.

1.3 A.3 More Results on Facial Expression Continuous Translation

We present more results of facial expression continuous translation on the CelebA-HQ dataset. Figures 20 and 21 show the reference-guided and latent-guided continuous translation results, respectively.

1.4 A.4 More Comparisons with State-of-the-arts

We present more quantitative and qualitative comparisons results in Figs. 22 and 23.

1.5 A.5 More Ablation Studies

We conduct additional ablation studies on other loss objectives commonly used for image-to-image translation in Fig. 24.

Cycle-Consistency Loss Both quantitative and qualitative results demonstrate that \({\mathcal {L}}_{1}^{\mathrm {cc}}\) helps preserve the consistency of domain-invariant characteristics of generated images. Without \({\mathcal {L}}_{1}^{\mathrm {cc}}\), the model achieves the highest DIPD value when \(\beta = 1.0\), as illustrated in the “DIPD vs. \(\beta \)” curve of Fig. 24b. Figure 24a also shows that the expression of generated image changes from the neutral into the smile when \(\beta >0.4\).

Self-Reconstruction Loss The model trained without \({\mathcal {L}}_{1}^{\mathrm {recon}}\) cannot reconstruct the original image well, as shown in Fig. 24a at \(\beta = 0\). Therefore, the “DIPD vs. \(\beta \)” curve of Fig. 24b demonstrates that it obtains the highest DIPD value when \(\beta = 0\).

Fig. 25
figure 25

Comparisons of the content representations of two domains on CelebA-HQ dataset using t-SNE. Each data point is a content representation encoded from an image of that domain

Content Adversarial Loss The “FID vs. \(\beta \)” curve of Fig. 24b shows that the full model has better FID values than the one without \({\mathcal {L}}_{\mathrm {adv}}^{\mathrm {content}}\). We also observe that the full model aligns content representations of two domains better than that trained without \({\mathcal {L}}_{\mathrm {adv}}^{\mathrm {content}}\), as shown in Fig. 25. Thus, the content discriminator helps align the distribution of the content representations of two domains and further disentangles the content and attribute representations.

Latent Regression and Mode Seeking Constraints To better illustrate the effectiveness of \({\mathcal {L}}_{1}^{\mathrm {latent}}\) and \({\mathcal {L}}_{\mathrm {ms}}\), we calculate the LPIPS score of generated images on the male \(\rightarrow \) female translation for diversity comparisons. As shown in Table 3, both \({\mathcal {L}}_{1}^{\mathrm {latent}}\) and \({\mathcal {L}}_\mathrm {{ms}}\) facilitate improving the diversity of generated images. In addition, we also observe that the model trained without \({\mathcal {L}}_\mathrm {{ms}}\) cannot capture the style of the reference image for the translated image, as shown in Fig. 24a. The “FID vs. \(\beta \)” curve of Fig. 24b further indicates that training with \({\mathcal {L}}_{\mathrm {ms}}\) enhances the quality of generated images.

Table 3 Diversity comparisons on the male \(\rightarrow \) female translation

Appendix B: Network Architecture

Table 4 shows the network configuration details where Conv(k, s, p) and DeConv(k, s, p) denote the convolutional layer and transposed convolutional layer with k as kernel size, s as stride, and p as padding; DownResBlock and UpResBlock adopt the average pooling for down-sampling and the nearest-neighbor interpolation for up-sampling respectively; LN is the layer normalization (Ba et al. 2016) and IN is the instance normalization (Ulyanov et al. 2016); AdaIN is the adaptive instance normalization (Huang and Belongie 2017); and LReLU indicates leaky ReLU (Maas et al. 2013) with a negative slope of 0.2.

Table 4 Details of network architectures: (a) content encoder \(E_c\) architecture, (b) generator G architecture. Style translation tasks directly concatenate the attribute vector, while shape-variation translation tasks adopt AdaIN to inject the attribute vector, (c) attribute encoder \(E_a\) and discriminator D architecture. The d is set to 8 and 1 for \(E_a\) and D respectively, (d) content discriminator \(D_c\) architecture. The c is set to 256 and 512 for style translation, and shape-variation translation tasks, respectively, and shape-variation translation tasks do not contain the first DownResBlock, (e) fusing network F architecture

Appendix C: The Video

We present translation videos at https://helenmao.github.io/SAVI2I/. The video shows some examples of generating diverse animations from a source input image. In particular, this is accomplished by continuously translating the input image to the target domains. We present reference-guided continuous translation and latent-guided continuous translation of the proposed method.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mao, Q., Tseng, HY., Lee, HY. et al. Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors. Int J Comput Vis 130, 517–549 (2022). https://doi.org/10.1007/s11263-021-01557-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-021-01557-6

Keywords

Navigation