Abstract
Recent image-to-image (I2I) translation algorithms focus on learning the mapping from a source to a target domain. However, the continuous translation problem that synthesizes intermediate results between two domains has not been well-studied in the literature. Generating a smooth sequence of intermediate results bridges the gap of two different domains, facilitating the morphing effect across domains. Existing I2I approaches are limited to either intra-domain or deterministic inter-domain continuous translation. In this work, we present an effectively signed attribute vector, which enables continuous translation on diverse mapping paths across various domains. In particular, we introduce a unified attribute space shared by all domains that utilize the sign operation to encode the domain information, thereby allowing the interpolation on attribute vectors of different domains. To enhance the visual quality of continuous translation results, we generate a trajectory between two sign-symmetrical attribute vectors and leverage the domain information of the interpolated results along the trajectory for adversarial training. We evaluate the proposed method on a wide range of I2I translation tasks. Both qualitative and quantitative results demonstrate that the proposed framework generates more high-quality continuous translation results against the state-of-the-art methods.
Similar content being viewed by others
References
Abdal, R., Qin, Y., & Wonka, P. (2019). Image2stylegan: How to embed images into the stylegan latent space? In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4432–4441.
Abdal, R., Qin, Y., & Wonka, P. (2020). Image2stylegan++: How to edit the embedded images? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8296–8305.
Anoosheh, A., Agustsson, E., Timofte, R., & Van Gool, L. (2018). Combogan: Unrestrained scalability for image domain translation. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 783–790.
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv:1607.06450
Brock, A., Donahue, J., & Simonyan, K. (2018). Large scale gan training for high fidelity natural image synthesis. In International conference on learning representations.
Burkov, E., Pasechnik, I., Grigorev, A., & Lempitsky, V. (2020). Neural head reenactment with latent pose descriptors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13786–13795.
Chen, Y. C., Lin, Y. Y., Yang, M. H., & Huang, J. B. (2019). Crdoco: Pixel-level domain transfer with cross-domain consistency. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1791–1800.
Chen, Y. C., Xu, X., Tian, Z., & Jia, J. (2019). Homomorphic latent space interpolation for unpaired image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2408–2416.
Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., & Choo, J. (2018). Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8789–8797.
Choi, Y., Uh, Y., Yoo, J., & Ha, J. W. (2020). Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8188–8197.
Gong, R., Li, W., Chen, Y., & Gool, L. V. (2019). Dlow: Domain flow for adaptation and generalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2477–2486.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680 (2014).
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems , pp. 6629–6640.
Hoffman, J., Tzeng, E., Park, T., Zhu, J. Y., Isola, P., Saenko, K., Efros, A., & Darrell, T. (2018). Cycada: Cycle-consistent adversarial domain adaptation. In International conference on machine learning (pp. 1989–1998). PMLR.
Huang, X., & Belongie, S. (2017). Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pp. 1501–1510.
Huang, X., Liu, M. Y., Belongie, S., & Kautz, J. (2018). Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV), pp. 172–189.
Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134.
Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision (pp. 694–711). Springer.
Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2018). Progressive growing of gans for improved quality, stability, and variation. In International conference on learning representations.
Kim, T., Cha, M., Kim, H., Lee, J. K., & Kim, J. (2017). Learning to discover cross-domain relations with generative adversarial networks. In International conference on machine learning (pp. 1857–1865). PMLR.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In International conference on learning representations.
Kotovenko, D., Sanakoyeu, A., Lang, S., & Ommer, B. (2019). Content and style disentanglement for artistic style transfer. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4422–4431.
Lample, G., Zeghidour, N., Usunier, N., Bordes, A., Denoyer, L., & Ranzato, M. (2017). Fader networks: Manipulating images by sliding attributes. Advances in Neural Information Processing Systems, 30, 5967–5976.
Lee, H. Y., Tseng, H. Y., Huang, J. B., Singh, M., & Yang, M. H. (2018). Diverse image-to-image translation via disentangled representations. In Proceedings of the European conference on computer vision, (pp. 35–51). Springer.
Lee, H. Y., Tseng, H. Y., Mao, Q., Huang, J. B., Lu, Y. D., Singh, M., & Yang, M. H. (2020). Drit++: Diverse image-to-image translation via disentangled representations. International Journal of Computer Vision, 128(10), 2402–2417.
Liao, J., Lima, R. S., Nehab, D., Hoppe, H., Sander, P. V., & Yu, J. (2014). Automating image morphing using structural similarity on a halfway domain. ACM Transactions on Graphics (TOG), 33(5), 1–12.
Lira, W., Merz, J., Ritchie, D., Cohen-Or, D., & Zhang, H. (2020). Ganhopper: Multi-hop gan for unsupervised image-to-image translation. In European conference on computer vision, pp. 363–379. Springer.
Liu, M. Y., Breuel, T., & Kautz, J. (2017). Unsupervised image-to-image translation networks. In Advances in neural information processing systems, pp. 700–708.
Liu, M. Y., Huang, X., Mallya, A., Karras, T., Aila, T., Lehtinen, J., & Kautz, J. (2019). Few-shot unsupervised image-to-image translation. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10551–10560.
Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier nonlinearities improve neural network acoustic models. In International conference on machine learning (Vol. 30, p. 3). Citeseer.
Mao, Q., Lee, H. Y., Tseng, H. Y., Ma, S., & Yang, M. H. (2019). Mode seeking generative adversarial networks for diverse image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1429–1437.
Mescheder, L., Geiger, A., & Nowozin, S. (2018). Which training methods for gans do actually converge? In International conference on machine learning (pp. 3481–3490). PMLR.
Park, T., Liu, M. Y., Wang, T. C., & Zhu, J. Y. (2019). Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2337–2346.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in pytorch. In Advances in neural information processing systems workshops.
Saito, K., Saenko, K., & Liu, M. Y. (2020). Coco-funit: Few-shot unsupervised image translation with a content conditioned style encoder. In Computer vision—ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, Part III (Vol. 16, pp. 382–398). Springer.
Shen, Y., Gu, J., Tang, X., & Zhou, B. (2020). Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9243–9252.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International conference on learning representations.
Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016). Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022
Voynov, A., & Babenko, A. (2020). Unsupervised discovery of interpretable directions in the gan latent space. In International conference on machine learning (pp. 9786–9796). PMLR.
Wang, T. C., Liu, M. Y., Zhu, J. Y., Tao, A., Kautz, J., & Catanzaro, B. (2018). High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807.
Wolberg, G. (1998). Image morphing: A survey. The Visual Computer, 14(8), 360–372.
Wu, P. W., Lin, Y. J., Chang, C. H., Chang, E. Y., & Liao, S. W. (2019). Relgan: Multi-domain image-to-image translation via relative attributes. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 5914–5922.
Wu, W., Cao, K., Li, C., Qian, C., & Loy, C.C. (2019). Transgaga: Geometry-aware unsupervised image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8012–8021.
Yu, X., Chen, Y., Liu, S., Li, T., & Li, G. (2019). Multi-mapping image-to-image translation via learning disentanglement. Advances in Neural Information Processing Systems, 32, 2994–3004.
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595.
Zhao, S., Song, J., & Ermon, S. (2017). Infovae: Information maximizing variational autoencoders. arXiv:1706.02262
Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 2223–2232.
Zhu, J. Y., Zhang, R., Pathak, D., Darrell, T., Efros, A. A., Wang, O., & Shechtman, E. (2017). Toward multimodal image-to-image translation. In Advances in neural information processing systems, pp. 465–476.
Acknowledgements
Q. Mao and S. Ma are supported in part by the National Natural Science Foundation of China (62025101), China Scholarship Council for 1 year visiting at the University of California at Merced, and High-performance Computing Platform of Peking University, State Key Laboratory of Media Convergence and Communication (Communication University of China), and Fundamental Research Funds for the Central Universities. H.-Y. Lee, H.-Y. Tseng and M.-H. Yang are supported in part by NSF CAREER grant 1149783.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Maja Pantic.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Additional Experiments
1.1 A.1 More Continuous Translation Results
We present more diverse continuous translation paths from the source domain to the target one in this section. We can continuously translate an input image \(I_s\) in the source domain to multiple images \(I_{t_1}, I_{t_2}, \dots , I_{t_N}\) in the target domain. We name this type as one input and various targets. The target attribute vector can be obtained from either extracting a reference image sampled from the target domain (reference-guided) or randomly generating an SAV of the target domain (latent-guided).
Reference-Guided Continuous Translation Figures 13, 14, 15, and 16 present more reference-guided continuous translation in both style translation and shape-variation tasks.
Latent-Guided Continuous Translation Figures 17 and 18 show more latent-guided continuous translation results in both style translation and shape-variation tasks, which uses randomly sampled signed attribute vectors of target domains.
1.2 A.2 Multiple Translation Paths using Diverse Intermediate Points
Given a source image \(I_s\) and a target image \(I_t\), we can obtain multiple continuous translation paths by passing different intermediate attribute vectors, called one input, one target, and diverse intermediate points. In particular, we can apply interpolation with multiple intermediate points between a source attribute vector and a target attribute vector to generate multiple translation paths, as presented in Fig. 19.
1.3 A.3 More Results on Facial Expression Continuous Translation
We present more results of facial expression continuous translation on the CelebA-HQ dataset. Figures 20 and 21 show the reference-guided and latent-guided continuous translation results, respectively.
1.4 A.4 More Comparisons with State-of-the-arts
We present more quantitative and qualitative comparisons results in Figs. 22 and 23.
1.5 A.5 More Ablation Studies
We conduct additional ablation studies on other loss objectives commonly used for image-to-image translation in Fig. 24.
Cycle-Consistency Loss Both quantitative and qualitative results demonstrate that \({\mathcal {L}}_{1}^{\mathrm {cc}}\) helps preserve the consistency of domain-invariant characteristics of generated images. Without \({\mathcal {L}}_{1}^{\mathrm {cc}}\), the model achieves the highest DIPD value when \(\beta = 1.0\), as illustrated in the “DIPD vs. \(\beta \)” curve of Fig. 24b. Figure 24a also shows that the expression of generated image changes from the neutral into the smile when \(\beta >0.4\).
Self-Reconstruction Loss The model trained without \({\mathcal {L}}_{1}^{\mathrm {recon}}\) cannot reconstruct the original image well, as shown in Fig. 24a at \(\beta = 0\). Therefore, the “DIPD vs. \(\beta \)” curve of Fig. 24b demonstrates that it obtains the highest DIPD value when \(\beta = 0\).
Content Adversarial Loss The “FID vs. \(\beta \)” curve of Fig. 24b shows that the full model has better FID values than the one without \({\mathcal {L}}_{\mathrm {adv}}^{\mathrm {content}}\). We also observe that the full model aligns content representations of two domains better than that trained without \({\mathcal {L}}_{\mathrm {adv}}^{\mathrm {content}}\), as shown in Fig. 25. Thus, the content discriminator helps align the distribution of the content representations of two domains and further disentangles the content and attribute representations.
Latent Regression and Mode Seeking Constraints To better illustrate the effectiveness of \({\mathcal {L}}_{1}^{\mathrm {latent}}\) and \({\mathcal {L}}_{\mathrm {ms}}\), we calculate the LPIPS score of generated images on the male \(\rightarrow \) female translation for diversity comparisons. As shown in Table 3, both \({\mathcal {L}}_{1}^{\mathrm {latent}}\) and \({\mathcal {L}}_\mathrm {{ms}}\) facilitate improving the diversity of generated images. In addition, we also observe that the model trained without \({\mathcal {L}}_\mathrm {{ms}}\) cannot capture the style of the reference image for the translated image, as shown in Fig. 24a. The “FID vs. \(\beta \)” curve of Fig. 24b further indicates that training with \({\mathcal {L}}_{\mathrm {ms}}\) enhances the quality of generated images.
Appendix B: Network Architecture
Table 4 shows the network configuration details where Conv(k, s, p) and DeConv(k, s, p) denote the convolutional layer and transposed convolutional layer with k as kernel size, s as stride, and p as padding; DownResBlock and UpResBlock adopt the average pooling for down-sampling and the nearest-neighbor interpolation for up-sampling respectively; LN is the layer normalization (Ba et al. 2016) and IN is the instance normalization (Ulyanov et al. 2016); AdaIN is the adaptive instance normalization (Huang and Belongie 2017); and LReLU indicates leaky ReLU (Maas et al. 2013) with a negative slope of 0.2.
Appendix C: The Video
We present translation videos at https://helenmao.github.io/SAVI2I/. The video shows some examples of generating diverse animations from a source input image. In particular, this is accomplished by continuously translating the input image to the target domains. We present reference-guided continuous translation and latent-guided continuous translation of the proposed method.
Rights and permissions
About this article
Cite this article
Mao, Q., Tseng, HY., Lee, HY. et al. Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors. Int J Comput Vis 130, 517–549 (2022). https://doi.org/10.1007/s11263-021-01557-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-021-01557-6