Skip to main content

Translation of Real-World Photographs into Artistic Images via Conditional CycleGAN and StarGAN


To translate a real-world photograph into an artistic image in the style of a famous artist, the selection of colors and brushstrokes should reflect those of the artist. A one-to-one domain translation architecture, CycleGAN, trained with an unpaired dataset can be used to translate a real-world photograph into an artistic image. However, to translate images in N number of multi-artistic styles, the disadvantage is that more than one CycleGAN must be trained corresponding to each style. Here, we develop a single deep learning architecture that can be controlled to yield multiple artistic styles by adding a conditional vector. The overall architecture includes a one-to-N domain translation architecture, namely, a conditional CycleGAN, and an N-to-N domain translation architecture, namely, StarGAN, for translating into five different artistic styles. An evaluation of the trained models reveal that multiple artistic styles can be produced from a single real-world photograph only by adjusting the conditional input.


Computer vision is making rapid advances through the application of the latest deep learning techniques. Object detection [1], image classification [2], and face recognition [3] are the foremost sub-disciplines in computer vision. Another sub-discipline gaining momentum in computer vision is image transformation or image translation. The fundamental theme of an image translation algorithm is to learn to transform an input image into an output image based on a given criterion [4]. Noise filtering, for example, is a transformation task that deals with filtering electronic noises produced in the process of image acquisition, storage, and retrieval [5]. Image colorization deals with transforming a grayscale image to a full color image [6]. Image super-resolution is the task of transforming the original image into higher resolution and greater detail [7]. Image captioning is also a transformation task, wherein a deep learning algorithm learns to describe the main contents of an image in text media captured as caption [8].

Style transfer is another upcoming research category in the image transformation discipline. It deals with the task of learning the style from one or more images and applying that style to a new image [9]. Converting facial photographs into portraits [10], caricatures [11], Japanese anime [12], or converting artistic pictures by famous artists into photographic style are some of the latest technologies which come under the hood of image style translation. This study deals with an inverse transformation. It converts a given photo image into the style of a desired artist.

Well-known artists generally have a unique style that reflects their mastery of colors and brushstrokes. Reproducing such a style by hand is difficult, but a method based on a deep learning architecture may be suitable for this task.

Translating a real-world landscape photograph into an artistic image is a kind of one-to-one domain translation. Studies have applied one-to-one domain translation to segmentation [13,14,15], colorization [14, 16, 17], and image translation [18,19,20]. The latter references, in particular, conducted one-to-one domain translation training with an unpaired dataset.

CycleGAN [21] is a representative one-to-one domain translation architecture that employs cycle consistency loss and adversarial loss. One study [21] demonstrated the translation of photographs into images in the style of a specific artist (or genre). The translated images reflected the edge and color features of the original photograph. This accomplishment motivated us to develop an architecture for the translation of photographs into artistic images.

Different from the aforementioned CycleGAN approach, our study translates real-world photographs into images in N artistic styles via a single generator. To control the output of the conditional artistic image, we utilized a technique from a conditional generative adversarial network (cGAN) [22]. This architecture inserts a conditional input related to the attributes of the data. cGAN is widely adopted in text-to-image synthesis [23], detection of target regions [24, 25], and segmentation learning [14]. The N-domain translation architecture StarGAN [26] inserts a conditional input with a one-hot vector representation for face attribute translation. We investigated whether conditional domain translation learning with an unpaired dataset can be applied to the translation of a photograph into a conditional artistic image.

In our experiment, we constructed two translation architectures, namely, conditional CycleGAN and StarGAN. These models were designed for the translation mapping from a content image to five kinds of artistic styles (Monet, Cezanne, Renoir, Ukiyo-e and van Gogh). CycleGAN is a one-to-one domain translation model that learns to mimic the features of target domain under unpaired datasets. It takes into account the content dataset and single target domain dataset. StarGAN is an N-to-N domains translation model that learns from N domains and translates into N kinds of target images.

We changed the location of the insertion of the conditional vector at the encoder stage and compared the resulting quality. In the evaluation phase, we tested the visualization of the conditional input, conducted an output quantitative evaluation with the Fréchet inception distance (FID) score, and determined classification accuracy with fine-tuned VGGNet [27] and ResNet with 152 layers [28].

In the experiments, we obtained the following results:

  • The trained model with a conditional vector translated the entire space of the input images into artistic images in a unique style following the vector.

  • The conditional models that inserted the vector in the first layer tended to fail in the translation; they output almost unchanged images even though the value of the conditional vector was different.

  • The translation results are greatly affected by the insertion position of the vector that specifies the domain.

  • Conditional CycleGAN had FID scores similar to those of CycleGAN for each style and StarGAN had the best accuracy in terms of recognition with fine-tuned models.

Related Work

Neural Style Transfer

A convolutional neural network (CNN) is a deep neural network that outputs several feature maps from a given input image. It enables the capture of multiple features through the convolution of the entire input image [29]. Because CNNs are good at extracting important features even if the inputs are distorted, they have been applied to image processing tasks, such as image recognition (AlexNet [30]) and object detection (Faster R-CNN [31]). CNN is also utilized in creative task such as super-resolution of face images [32], generating terrain [33], and generation of conditional single text from alphabet to katakana (Japanese alphabet for transcribing foreign words) [34].

The Neural Algorithm of Artistic Style [35,36,37] uses a trained VGGNet, which is a network that adopts deep CNN layers [27] to obtain content spatial information (content representation) and style texture information (style information) in the middle layers. By repeating feed-forward and back propagation, the algorithm trains the VGGNet to construct an image that has the shapes of the content image and the style of the style image.

Jonson et al. [38] used Image Translate Net to generate artistic images and VGGNet for calculating two kinds of losses, namely, the Euclidean distance between the representations of the content images and the generated images and (2) the style difference through a Gram matrix.

As an extended application of this method [38], an architecture that performs multiple style transfers by concatenating the content image and the style image was developed [39].

Unpaired Image-to-Image Translation

The neural style transfer approach translates content images into artistic images using an image in a single style (or images in N styles). However, these artistic images consist of edge features from the content image, with color features replaced by those from the style image. The goal of our study is to mimic the style of a specific artist (or genre) based on the related artwork. Therefore, we also utilize the color features from the content images.

CycleGAN [21] is a representative image-to-image translation architecture that does not require paired images. This architecture aims to learn a translation mapping which maintains the features of the content image. CycleGAN calculates two kinds of losses, namely, adversarial loss (patch loss [14]), which is obtained by discriminating whether each of the output patches is real (input images belong to the target translation dataset) or fake (input images are generated by a generator), and cycle consistency loss, which is obtained as the L1 distance. After input data \(x \in X\) are translated into \(G(x)\) through mapping function \(G:X\to Y\), \(G(x)\), they are translated again through the mapping function \(F:Y\to X.\) For cycle consistency loss, whether \(G(x)\) can keep the content information (whether it is made up of \(x \approx F(G\left(x\right))\)) is used as a criterion. This penalty loss helps unpaired image learning and translation between separated domains, such as photographs and paintings.

One-to-one domain translation can be conducted using architectures other than CycleGAN. UNIT [40] is a two-domain translation framework based on GANs and variational autoencoders that creates a shared latent space via encoders. U-GAT-IT [20] adopts attention feature maps and the AdaLIN (Adaptive Layer-Instance Normalization) function for the extreme domain translation application of a still image to an animation. NICE-GAN [41] reuses the hidden vector from the discriminator and inputs it to the generator.

These one-to-one domain translation architectures use unpaired images. However, to translate a photograph into an artistic image in a particular style (Monet, van Gogh, Ukiyo-e, etc.), more than one generator is necessary for each artistic style.

Conditional Image-to-Image Translation

To perform one-to-multiple artistic domain translation using a single CycleGAN, cGAN [22], a one-to-one domain translation architecture, can be applied.

In a previous study [22], a conditional vector that corresponds to the classification label was used as an additional input to the generator and the discriminator. The generator learns to generate realistic images following the conditional vector and the discriminator discriminates the input images as real or fake using the vector.

The one-to-N domain translation architecture called conditional CycleGAN inputs the conditional vector to the generator and the discriminator following the method of cGAN. A conditional vector \(c\) is concatenated with content data \(x\) and the cycle consistency loss is calculated as in the case of CycleGAN; i.e., it is determined whether \(x \approx F(G\left(x,c\right))\)) (\(c\in C\) is the conditional vector that corresponds to the specified category from the entire set of categories \(C\)). Conditional CycleGAN has been applied to attribute-guided face generation [42] and the translation of food categories [43].

The N-to-N domain translation architecture called StarGAN [26] is capable of multi-domain translation learning between facial attributes by inputting the images and target attributes to the generator. Similar to StarGAN, RelGAN [44] replaces binary-value attributes with relative attributes that indicate the difference between source attributes and target attributes.

The translation target is something common between the images belonging to the domains \(X\) and \(Y\). We tried translating photographs into multiple artistic images by setting the target to be entire images.

Proposed Conditional Architecture

As a first step of developing the conditional translation of photographs into artistic images, we constructed and expanded the representative domain translation models conditional CycleGAN, which adds a conditional vector to the generator and discriminator, and StarGAN. We compared the models by changing the insertion location of the conditional vector.

Conditional CycleGAN

Figure 1 shows an overview of the architecture of generators \(G\) and \(F\) and Discriminators \({D}_{X}\) and \({D}_{Y}\). When translating photographs into artistic images, the conditional vector with a one-hot vector representation is input to \(G\) and \({D}_{Y}\). At \({D}_{Y}\), the discriminated images and the vector are concatenated and input to the first layer. In \(G\), the conditional input is inserted at different locations: (1) the first layer, (2) before the residual block layer (counted as the third layer), and (3) each layer of the encoder. Figure 2 shows the details of the architecture of our conditional CycleGAN. When transferring an artistic style to photographs, we train \(F\) to generate realistic photographs and \({D}_{X}\) to discriminate between the real and fake photographs without the conditional vector, as in the case of CycleGAN.

Fig. 1

Overview of translation of proposed conditional CycleGAN

Fig. 2

Architecture of proposed conditional CycleGAN

The full objective loss function employed here is shown in Eq. (1). \(c\in C\) means the latent target style vector with a one-hot representation. We set the adversarial loss and cycle consistency loss based on CycleGAN. We added an identity mapping loss following a previous study [45]. This loss helps prevent \(G\) and \(F\) from overcorrecting the tint of the input images [identity mapping loss is given in Eq. (6)]:

$$L\left(G,F, {D}_{x}, {D}_{y}\right)= {L}_{{\rm GAN}}\left(G, {D}_{Y}, X, Y\right)$$
$$\qquad+ {L}_{{\rm GAN}}\left(F, {D}_{X}, X, Y\right)$$
$$\qquad+ {{\lambda }_{{\rm cyc}}L}_{{\rm cyc}}\left(G, F\right)+ {{\lambda }_{{\rm idt}}L}_{{\rm idt}}\left(G, F\right)$$
$${L}_{{\rm GAN}}\left(G, {D}_{Y}, X, Y\right) ={\mathbb{E}}_{y\sim {p}_{{\rm data}}\left(y\right)}\left[log{D}_{Y}(y, c)\right]$$
$$\qquad+ {\mathbb{E}}_{y\sim {p}_{{\rm data}}(y)}\left[1-log{D}_{Y}(G(x, c))\right]$$
$${L}_{{\rm GAN}}\left(F, {D}_{X}, X, Y\right) ={\mathbb{E}}_{x\sim {p}_{{\rm data}}\left(x\right)}\left[log{D}_{X}(x)\right]$$
$$\qquad+ {\mathbb{E}}_{x\sim {p}_{{\rm data}}(x)}\left[1-log{D}_{X}(F(y))\right]$$
$${L}_{{\rm cyc}}\left(G, F\right)= {\mathbb{E}}_{x\sim {p}_{{\rm data}}\left(x\right)}\left[{\Vert F\left(G\left(x, c\right)\right)-x\Vert }_{1}\right]$$
$$\qquad+ {\mathbb{E}}_{y\sim {p}_{{\rm data}}(y)}\left[{\Vert G\left(F\left(y\right)\right)-y\Vert }_{1}\right]$$
$${L}_{{\rm idt}}\left(G, F\right)= {\mathbb{E}}_{y\sim {p}_{{\rm data}}\left(y\right)}\left[{\Vert G\left(y,c\right)-y\Vert }_{1}\right]$$
$$\qquad+ {\mathbb{E}}_{x\sim {p}_{{\rm data}}(x)}\left[{\Vert F\left(x\right)-x\Vert }_{1}\right]$$


We constructed the generator of this model based on [21] (because [26] used a size of 128 × 128 pixels, we translated images with a size of 256 × 256 pixels) and the Discriminator based on [26]. In the original method of StarGAN, the conditional input is inserted in the first layer. In our study, we compared StarGANs by changing the insertion location of the conditional vector for determining the effect on image translation. Figure 3 shows an overview of the architecture of the proposed StarGAN. The full objective loss functions for the generator and discriminator are shown in Eqs. (6) and (7), respectively. In Eq. (7), we followed the technique in [26] that adds \({L}_{\rm gp}\), the gradient penalty of the Wasserstein GAN objective [46], for stable mapping learning. \({c}_{\rm src}, {c}_{\rm target}\in {C}^{{\prime}}\) relate the latent domains that consist of N art style domains and one content domain. Generator \(G\) learns conditional mapping following the values of \({c}_{\rm target}\) and reconstructs images with \({c}_{\rm src}\). Discriminator \(D\) discriminates real or fake images through all patches and the classification domain:

Fig. 3

Overview of translation of proposed StarGAN

$${L}_{G}= {L}_{{\rm GAN}}^{G}+{\lambda }_{\rm cls}{L}_{\rm cls}^{G}+ {{\lambda }_{{\rm cyc}}L}_{{\rm cyc}}+ {{\lambda }_{{\rm idt}}L}_{{\rm idt}}$$
$${L}_{D}= {L}_{{\rm GAN}}^{D}+{\lambda }_{\rm cls}{L}_{\rm cls}^{D}+{{\lambda }_{\rm gp}L}_{\rm gp}$$
$${L}_{{\rm GAN}}^{G} = {\mathbb{E}}_{x\sim {p}_{{\rm data}}\left(x\right)}\left[logD(G(x, {c}_{\rm target}))\right]$$
$${L}_{{\rm GAN}}^{D} = {\mathbb{E}}_{x\sim {p}_{{\rm data}}\left(x\right)}\left[logD\left(x\right)\right]$$
$$+ {\mathbb{E}}_{x\sim {p}_{{\rm data}}(x)}\left[1-logD(G(x, {c}_{\rm target}))\right]$$
$$\qquad{{\lambda }_{\rm gp}L}_{\rm gp}= {\mathbb{E}}_{x\sim {p}_{{\rm data}}\left(x\right)}\left[{\left({\parallel {\nabla }_{\widehat{x}}D(\widehat{x})\parallel }_{2}-1\right)}^{2}\right]$$
$${{L}_{\rm cls}^{G}={\mathbb{E}}}_{x\sim {p}_{{\rm data}}(x)}\left[-log{D}_{\rm cls}({c}_{\rm target}|G(x, {c}_{\rm target}))\right]$$
$${{L}_{\rm cls}^{D}={\mathbb{E}}}_{y\sim {p}_{{\rm data}}(y)}\left[-log{D}_{\rm cls}({c}_{\rm src}|y)\right]$$
$${L}_{{\rm cyc}}= {\mathbb{E}}_{x\sim {p}_{{\rm data}}\left(x\right)}\left[{\Vert G\left(G\left(x, {c}_{\rm target}\right), {c}_{\rm src}\right)-x\Vert }_{1}\right]$$
$${L}_{{\rm idt}}= {\mathbb{E}}_{x\sim {p}_{{\rm data}}\left(x\right)}\left[{\Vert G\left(x, {c}_{\rm src}\right)-x\Vert }_{1}\right]$$

Figure 4 shows the details of the architecture of our StarGAN. We compare the results obtained with three types of generators, namely, those with the conditional vector inserted in (1) the first layer, (2) before the residual block layer, and (3) each layer of the encoder.

Fig. 4

Architecture of proposed StarGAN


Data Set

Real-World Photograph Datasets

We used the real-world photograph dataset created by Zhu et al. [9] (source URL: 6287 images were used for training and 751 images were used for testing.

Artwork Dataset

We used the artwork dataset from the Kaggle competition site “Painter by Numbers” [47] (many of the images were obtained from Using the attached annotation, we extracted artwork by Claude Monet, Paul Cezanne, Pierre-Auguste Renoir, and Vincent van Gogh and the genre Ukiyo-e (we omitted images with sketch and nude paintings). 498 (Claude Monet), 493 (Paul Cezanne), 462 (Pierre-Auguste Renoir), 374 (Vincent van Gogh), and 1350 (Ukiyo-e) images were used. In the image pre-processing, the content images and the style images were resized to 286 × 286 pixels and randomly cropped to 256 × 256 pixels. During testing, the images were resized to 256 × 256 pixels without any random cropping.

Learning Environment

We constructed and trained conditional CycleGAN using PyTorch. All training was carried out on an NVIDIA GeForce RTX2080 Ti GPU. With the batch size set to 5, we trained each of the generators (\(G\) and \(F\)) and discriminators (\({D}_{X}\) and \({D}_{Y}\)) for 100 epochs. The weights in each model were initialized with the normal distribution (μ = 0.0 and σ = 0.02). As the optimization function, we adopted Adam [48] with learning rate = 0.0002, \({\beta }_{1}=\) 0.5, and \({\beta }_{2}=\) 0.999. The models ae trained with \({\lambda }_{{\rm cyc}}=10\), \({\lambda }_{{\rm idt}}=5\), \({\lambda }_{\rm cls}=1\), and \({\lambda }_{\rm gp}=10\).


Visualization of Conditional Translation Results

With the trained models, we visualized the conditional translated images from the testing photograph dataset. The translated artistic images are shown in Fig. 5.

Fig. 5

Results of conditional artistic image translation via domain translation architecture

In the results obtained with our conditional CycleGAN, the model that inserted the conditional vector in the third layer and the one that inserted the vector in each of the layers of the encoder were found to generate output images which maintained the unique style well. These models retained the color features of the content better than Gatys et al.’s neural style transfer [35] approach. For instance, the translated Ukiyo-e style images were moderately light in color and shape, and the translated Monet style replicated the characteristic brushstrokes well. Comparing the quality of these two models, the former tended to more strongly reflect the style features than did the latter. The model that inserted the vector in the first layer output almost identical style images.

In the results obtained with our StarGAN, the models that inserted the vector in the third layer and each of the layers of the encoder could output unique styles, while the model that inserted the vector in the first layer could not reflect the style features in the output images. Different from conditional CycleGAN, the model that inserted the vector in each layer of the encoder output images that more strongly reflected the unique features than did the model that inserted it in the third layer. These models tended to output extreme color tinting, especially when translating the Renoir style.

These results show that the location of the conditional vector affects the translation quality for N-domain translation architectures. The trained models (except those that inserted the vector in the first layer) output conditional images that preserved the style, as was the case for CycleGAN. It is believed that the models that inserted the conditional input in the first layer output poorly translated images, because the features that concatenate content images and conditional inputs become weaker as the inputs go through the bottleneck during training.

More results obtained using our models are shown in the appendices. Appendix A shows the conditional translation results obtained with the testing dataset and Appendix B shows those obtained with the author’s photographs.

Comparison of FID Scores

In the visualization results, the conditional CycleGANs and StarGANs output conditional images through a single generator that translated the content images. To evaluate the quality of the images obtained using these models to real artwork, we calculated the distance from a given real art domain in terms of the FID score, which captures the similarity between generated (fake) images and target (real) images [49].

Table 1 shows the FID scores of models for each style. A smaller FID score means that the generator could output images that were more similar to those in the real image dataset. In Table 1, bold red (blue) text indicates the smallest (second smallest) score in each style.

Table 1 FID scores for various domain translation architectures
Table 2 Classification results for images conditionally generated by VGG-16
Table 3 Classification results for images conditionally generated by VGG-19
Table 4 Classification results for images conditionally generated by ResNet-152

Conditional CycleGAN with the conditional vector inserted in the first layer and StarGAN with the vector inserted in the first layer have the largest scores. This indicates that these models output images very different from the real images. The representative one-to-one domain translation architecture CycleGAN had very good scores for each style. Conditional CycleGAN with the vector inserted in the third layer had the second best scores for the Monet, Renoir, and Ukiyo-e styles. A comparison of the best and the second best scores for each style indicates that conditional CycleGAN was similar to CycleGAN, with a difference in score of about 10, while our StarGANs could not output comparably high scores.

CycleGAN has a specified target and thus can focus its training. In contrast, N-domain translation architectures such as conditional CycleGAN and StarGAN needed to deal with images with N styles. CycleGAN thus outperforms these models.

Classification Style Results

We classified each style of generated artistic images using CNN networks. We employed and fine-tuned VGG-16, VGG-19, and ResNet-152 using the “Painter by Numbers” dataset used for translation learning. In fine-tuning, we split the dataset into 7:3 (7% for training and 3% for validation). We ensured that each model had an accuracy of around 95% for the validation dataset. For classification using the fine-tuned models, we used 751 testing images of the real-world photograph dataset as the target for style prediction. These images were translated into each art style by controlling the conditional vector and input into the fine-tuned model for style prediction. The classification results are shown in Tables 2, 3, 4 for VGG-16, VGG-19, and ResNet-152, respectively. Bold texts indicate the best accuracy for each art style.

These tables show that the StarGANs (except the models with the conditional vector inserted in the first layer) had the best accuracy, mainly in translating to the Monet, Renoir, and van Gogh styles. Considering that their FID scores were not very high and their accuracy scores were the best, the StarGANs tended to output conditional images that were biased towards features, such as intensity of color or texture. Therefore, these images had relatively poor FID scores because of less color variety and high accuracy scores due to the intensity of the color features. For each fine-tuned model, the accuracy of conditional CycleGANs was mostly lower than those of CycleGAN and StarGANs.


This study developed conditional artistic image translation models based on conditional CycleGAN and StarGAN to translate photographs into artistic images. We compared the quality of the translated images by changing the insertion location of the conditional vector. The visualization of conditional artistic images with trained models indicated that it is possible to translate the entire space of an input image to conditional artistic images with the style controlled by a vector. We found that the insertion location of the conditional vector affects the output image quality.

The visualization results, FID scores, and classification accuracy obtained with fine-tuned models indicate directions for future research. For example, the maximum number of image styles that can be preserved using an N-domain translation model should be determined; the FID scores should be improved to match those of CycleGAN; and the accuracy of art styles should be kept while minimizing FID scores.

We found the best insertion location of the conditional vector, but still need to improve our conditional CycleGAN and StarGAN, for instance, by replacing the instance norm with AdaIN (Adaptive Instance Normalization) which adjusts the mean and variance of the content image to match those of style image [50], or AdaLIN to help the generator control the amount of needed change in shape and texture [20].

In the construction of the N-to-N domain model, the number of conditional vectors required for conditional transformation and the timing of insertion were the experimental findings of this study. However, in terms of FID scores, the proposed N-to-N domains (rather than the existing Conditional CycleGAN and StarGAN with vectors inserted at different positions) are still not superior to CycleGAN. Therefore, we would like to construct an N-to-N domain translation model that can learn wide range of domains and is comparable to CycleGAN and other one-to-one domains in terms of FID score.


  1. 1.

    Wu X, Sahoo D, Hoi SC. Recent advances in deep learning for object detection. Neurocomputing. 2020;396:39–64.

    Article  Google Scholar 

  2. 2.

    Druzhkov PN, Kustikova VD. A survey of deep learning methods and software tools for image classification and object detection. Pattern Recognit Image Anal. 2016;26(1):9–15.

    Article  Google Scholar 

  3. 3.

    Parkhi OM, Vedaldi A, Zisserman A. Deep face recognition (2015).

  4. 4.

    Hou X, Gong Y, Liu B, Sun K, Liu J, Xu B, Qiu G. Learning based image transformation using convolutional neural networks. IEEE Access. 2018;6:49779–92.

    Article  Google Scholar 

  5. 5.

    Komatsu R, Tad G. Comparing u-net based models for denoising color images. AI, MDPI. 2020;1(4):465–86.

    Google Scholar 

  6. 6.

    Deshpande A, Rock J, Forsyth D. Learning large-scale automatic image colorization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 567–575 (2015).

  7. 7.

    Monkumar A, Sannathamby L, Goyal SB. Unified framework of dense convolution neural network for image super resolution. Mater Today Proc. 2021. (ISSN 2214–7853).

    Article  Google Scholar 

  8. 8.

    Cao S, An G, Zheng Z, Ruan Q. Interactions guided generative adversarial network for unsupervised image captioning. Neurocomputing. 2020;417:419–31. (ISSN 0925–2312).

    Article  Google Scholar 

  9. 9.

    Sagar, Vishwakarma DK. A state-of-the-arts and prospective in neural style transfer. In: 2019 6th International Conference on Signal Processing and Integrated Networks (SPIN), pp. 244–247 (2019).

  10. 10.

    Liu Q, Zhang F, Lin M, Wang Y. Portrait style transfer with generative adversarial networks. In: Liu Q, Liu X, Li L, Zhou H, Zhao HH, editors. Proceedings of the 9th international conference on computer engineering and networks. Advances in intelligent systems and computing, vol. 1143. Singapore: Springer; 2021.

    Chapter  Google Scholar 

  11. 11.

    Li S, Songzhi S, Lin J, Cai G, Sun L. Deep 3D caricature face generation with identity and structure consistency. Neurocomputing. 2021;454:178–88. (ISSN 0925–2312).

    Article  Google Scholar 

  12. 12.

    Li B, Zhu Y, Wang Y, Lin CW, Ghanem B, Shen L. AniGAN: style-guided generative adversarial networks for unsupervised anime face generation (2021). arXiv preprint arXiv:2102.12593.

  13. 13.

    Souly N, Spampinato C, Shah M. Semi supervised semantic segmentation using generative adversarial network. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5688–5696 (2017).

  14. 14.

    Isola P, Zhu J, Zhou T, Efros AA. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134 (2017).

  15. 15.

    Choi J, Kim T, Kim C. Self-ensembling with gan-based data augmentation for domain adaptation in semantic segmentation. In: Proceedings of the IEEE international conference on computer vision, pp. 6830–6840 (2019).

  16. 16.

    Cheng Z, Yang Q, Sheng B. Deep colorization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 415–423 (2015).

  17. 17.

    Iizuka S, Simo-Serra E, Ishikawa H. Let there be color! joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Trans Graph (ToG). 2016;35(4):1–11.

    Article  Google Scholar 

  18. 18.

    Chen Y, Lai Y, Liu Y. CartoonGAN: generative adversarial networks for photo cartoonization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9465–9474 (2018).

  19. 19.

    Chen J, Liu G, Chen X. AnimeGAN: a novel lightweight GAN for photo animation, International Symposium on Intelligence Computation and Applications. Singapore: Springer; 2019. p. 242–56.

    Google Scholar 

  20. 20.

    Kim J, Kim M, Kang H, Lee K. U-GAT-IT: unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation (2019). arXiv preprint arXiv:1907.10830.

  21. 21.

    Zhu J, Park T, Isola P, Efros AA. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp. 2223–2232 (2017).

  22. 22.

    Mizura M, Osindero S. Conditional generative adversarial nets (2014). arXiv preprint arXiv:1411.1784.

  23. 23.

    Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas D. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp. 5907–5915 (2017).

  24. 24.

    Li X, Zhang Y, Zhang J, Chen Y, Li H, Marsic I, Burd RS. Region-based activity recognition using conditional GAN. In: Proceedings of the 25th ACM international conference on Multimedia, pp. 1059–1067 (2017).

  25. 25.

    Nguyen V, Vicente TFY, Zhao M, Hoai M, Samaras D. Shadow detection with conditional generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4510–4518 (2017).

  26. 26.

    Choi Y, Choi M, Kim M, Ha J, Kim S, Choo J. StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8789–8797 (2018).

  27. 27.

    Simonyan K, Zisserman A. Very deep convolution networks for large-scale image recognition (2014). arXiv preprint arXiv:1409.1556.

  28. 28.

    He K, Zhang Z, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016).

  29. 29.

    LeCun Y, Haffner P, Bottou L, Bengio Y. Object recognition with gradient-based learning, shape, contour and grouping in computer vision. Berlin Heidelberg: Springer; 1999. p. 319–45.

    Book  Google Scholar 

  30. 30.

    Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Adv Neural Inform Process Syst. 2012;25:1097–105.

    Google Scholar 

  31. 31.

    Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. Adv Neural Inform Process Syst. 2015;28:91–9.

    Google Scholar 

  32. 32.

    Ying L, Dinghua S, Fuping W, Pang LK, Kiang CT, Yi L. Learning wavelet coefficients for face super-resolution. Visual Computer. 2020;37:1–10.

    Google Scholar 

  33. 33.

    Zhang J, Wang C, Li C, Qin H. Example-based rapid generation of vegetation on terrain via CNN-based distribution learning. Vis Comput. 2019;35:1181–91.

    Article  Google Scholar 

  34. 34.

    Komatsu R, Gonsalves T. Conditional DCGAN's challenge: generating handwritten character digit, alphabet and katakana. In: Proceedings of the Annual Conference of JSAI 33rd Annual Conference, pp. 3B3E204–3B3E204. The Japanese Society for Artificial Intelligence (2019).

  35. 35.

    Gatys LA, Ecker AS, Bethge M. A neural algorithm of artistic style (2015). arXiv preprint arXiv:1508.06576.

  36. 36.

    Gatys LA, Ecker AS, Bethge M. Image style transfer using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2414–2423 (2016).

  37. 37.

    Wang L, Wang Z, Yang X, Hu SM, Zhang J. Photographic style transfer. Vis Comput. 2020;36:317–31.

    Article  Google Scholar 

  38. 38.

    Johnson J, Alahi A, Fei-Fei L. Perceptual losses for real-time style transfer and super-resolution, European conference on computer vision. Cham: Springer; 2016. p. 694–711.

    Google Scholar 

  39. 39.

    Yanai K. Unseen style transfer based on a conditional fast style transfer network. In: Workshop of International Conference on Learning Representations (2017).

  40. 40.

    Liu M, Breuel T, Kautz J. Unsupervised image-to-image translation networks. Adv Neural Inform Process Syst. 2017:700–708.

  41. 41.

    Chen R, Huang W, Huang B, Sun F, Fang B. Reusing discriminators for encoding: towards unsupervised image-to-image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8168–8177 (2020).

  42. 42.

    Lu Y, Tai Y, Tang C. Attribute-guided face generation using conditional CycleGAN. In: Proceedings of the European conference on computer vision (ECCV), pp. 282–297 (2018).

  43. 43.

    Horita D, Tanno R, Shimoda W, Yanai K. Food category transfer with conditional CycleGAN and a large-scale food image dataset. In: Proceedings of the Joint Workshop on Multimedia for Cooking and Eating Activities and Multimedia Assisted Dietary Management, pp. 67–70 (2018).

  44. 44.

    Nie W, Narodytska N, Patel AB. RelGAN: relational generative adversarial networks for text generation. In: International conference on learning representations (2018).

  45. 45.

    Taigman Y, Polyak A, Wolf L. Unsupervised cross-domain image generation (2016). arXiv preprint arXiv:1611.02200.

  46. 46.

    Arjovsky M, Chintala S, Bottou L. Wasserstein generative adversarial networks. In: Proceedings of the 34th international conference on machine learning, vol 70, pp. 214–223 (2017).

  47. 47.

    Duck SK, Nichol K. Painter by Number. (2016) Accessed 28 Aug 2020.

  48. 48.

    Kingma DP, Ba JL. Adam: a method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980.

  49. 49.

    Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S. GANs trained by a two time-scale update rule converge to a local nash equilibrium. Adv Neural Inform Process Syst. 2017;30:6626–37.

    Google Scholar 

  50. 50.

    Huang X, Belongie S. Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510 (2017).

Download references


This study has received no funding.

Author information



Corresponding author

Correspondence to Tad Gonsalves.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.



Visualized Results with Testing Images


Visualized Results with Author’s Photos


Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Komatsu, R., Gonsalves, T. Translation of Real-World Photographs into Artistic Images via Conditional CycleGAN and StarGAN. SN COMPUT. SCI. 2, 489 (2021).

Download citation


  • Deep learning
  • Conditional style transfer
  • Generative adversarial network (GAN)
  • Multi-domain translation