Dual Generator Generative Adversarial Networks for Multidomain ImagetoImage Translation
Abstract
Stateoftheart methods for imagetoimage translation with Generative Adversarial Networks (GANs) can learn a mapping from one domain to another domain using unpaired image data. However, these methods require the training of one specific model for every pair of image domains, which limits the scalability in dealing with more than two image domains. In addition, the training stage of these methods has the common problem of model collapse that degrades the quality of the generated images. To tackle these issues, we propose a Dual Generator Generative Adversarial Network (G\(^2\)GAN), which is a robust and scalable approach allowing to perform unpaired imagetoimage translation for multiple domains using only dual generators within a single model. Moreover, we explore different optimization losses for better training of G\(^2\)GAN, and thus make unpaired imagetoimage translation with higher consistency and better stability. Extensive experiments on six publicly available datasets with different scenarios, i.e., architectural buildings, seasons, landscape and human faces, demonstrate that the proposed G\(^2\)GAN achieves superior model capacity and better generation performance comparing with existing imagetoimage translation GAN models.
Keywords
Generative Adversarial Network Imagetoimage translation Unpaired data Multidomain1 Introduction
To overcome the aforementioned limitation, Choi et al. propose StarGAN [5] (Fig. 1(b)), which can perform multidomain imagetoimage translation using only one generator/discriminator pair with the aid of an auxiliary classifier [25]. More formally, let X and Y represent training sets of two different domains, and \(x\in X\) and \(y\in Y\) denote training images in domain X and domain Y, respectively; let \(z_y\) and \(z_x\) indicate category labels of domain Y and X, respectively. StarGAN utilizes the same generator G twice for translating from X to Y with the labels \(z_y\), i.e., \(G(x, z_y)\approx y\), and reconstructs the input x from the translated output \(G(x, z_y)\) and the label \(z_x\), i.e., \(G(G(x, z_y), z_x)\approx x\). In doing so, the same generator shares a common mapping and data structures for two different tasks, i.e., translation and reconstruction. However, since each task has unique information and distinct targets, it is harder to optimize the generator and to make it gain good generalization ability on both tasks, which usually leads to blurred generation results.
In this paper, we propose a novel Dual Generator Generative Adversarial Network (G\(^2\)GAN) (Fig. 1(c)). Unlike StarGAN, G\(^2\)GAN consists of two generators and one discriminator, the translation generator \(G^t\) transforms images from X to Y, and the reconstruction generator \(G^r\) uses the generated images from \(G^t\) and the original domain label \(z_x\) to reconstruct the original x. Generators \(G_t\) and \(G_r\) cope with different tasks, and the input data distribution for them is different. The input of \(G_t\) is a real image and a target domain label. The goal of \(G_t\) is to generate the target domain image. While \(G_r\) accepts a generated image and an original domain label as input, the goal of \(G_r\) is to generate an original image. For \(G_t\) and \(G_r\), the input images are a real image and a generated image, respectively. Therefore, it is intuitive to design different network structures for the two generators. The two generators are allowed to use different network designs and different levels of parameter sharing according to the diverse difficulty of the tasks. In this way, each generator can have its own network parts which usually helps to learn better each taskspecific mapping in a multitask setting [31].

We propose a novel Dual Generator Generative Adversarial Network (G\(^2\)GAN), which can perform unpaired imagetoimage translation among multiple image domains. The dual generators, allowing different network structures and differentlevel parameter sharing, are designed to specifically cope with the translation and the reconstruction tasks, which facilitates obtaining a better generalization ability of the model to improve the generation quality.

We explore jointly utilizing different objectives for a better optimization of the proposed G\(^2\)GAN, and thus obtaining unpaired multimodality translation with higher consistency and better stability.

We extensively evaluate G\(^2\)GAN on six different datasets in different scenarios, such as architectural buildings, seasons, landscape and human faces, demonstrating its superiority in model capacity and its better generation performance compared with stateoftheart methods on the multidomain imagetoimage translation task.
2 Related Work
Generative Adversarial Networks (GANs) [6] are powerful generative models, which have achieved impressive results on different computer vision tasks, e.g., image generation [26, 35], editing [4, 36] and inpainting [15, 46]. However, GANs are difficult to train, since it is hard to keep the balance between the generator and the discriminator, which makes the optimization oscillate and thus leading to a collapse of the generator. To address this, several solutions have been proposed recently, such as Wasserstein GAN [2] and LossSensitive GAN [28]. To generate more meaningful images, CGAN [23] has been proposed to employ conditioned information to guide the image generation. Extra information can also be used such as discrete category labels [16, 27], text descriptions [20, 29], object/face keypoints [30, 41], human skeleton [37, 39] and referenced images [8, 14]. CGAN models have been successfully used in various applications, such as image editing [27], texttoimage translation [20] and imagetoimage translation [8] tasks.
ImagetoImage Translation. CGAN models learn a translation between image inputs and image outputs using neutral networks. Isola et al. [8] design the pix2pix framework which is a conditional framework using a CGAN to learn the mapping function. Based on pix2pix, Zhu et al. [50] further present BicycleGAN which achieves multimodal imagetoimage translation using paired data. Similar ideas have also been applied to many other tasks, e.g. generating photographs from sketches [33]. However, most of the models require paired training data, which are usually costly to obtain.
Unpaired ImagetoImage Translation. To alleviate the issue of pairing training data, Zhu et al. [49] introduce CycleGAN, which learns the mappings between two unpaired image domains without supervision with the aid of a cycleconsistency loss. Apart from CycleGAN, there are other variants proposed to tackle the problem. For instance, CoupledGAN [18] uses a weightsharing strategy to learn common representations across domains. Taigman et al. [38] propose a Domain Transfer Network (DTN) which learns a generative function between one domain and another domain. Liu et al. [17] extend the basic structure of GANs via combining the Variational Autoencoders (VAEs) and GANs. A novel DualGAN mechanism is demonstrated in [47], in which image translators are trained from two unlabeled image sets each representing an image domain. Kim et al. [10] propose a method based on GANs that learns to discover relations between different domains. However, these models are only suitable in crossdomain translation problems.
Multidomain Unpaired ImagetoImage Translation. There are only very few recent methods attempting to implement multimodal imagetoimage translation in an efficient way. Anoosheh et al. propose a ComboGAN model [1], which only needs to train m generator/discriminator pairs for m different image domains. To further reduce the model complexity, Choi et al. introduce StarGAN [5], which has a single generator/discriminator pair and is able to perform the task with a complexity of \(\mathrm {\Theta }(1)\). Although the model complexity is low, jointly learning both the translation and reconstruction tasks with the same generator requires the sharing of all parameters, which increases the optimization complexity and reduces the generalization ability, thus leading to unsatisfactory generation performance. The proposed approach aims at obtaining a good balance between the model capacity and the generation quality. Along this research line, we propose a Dual Generator Generative Adversarial Network (G\(^2\)GAN), which achieves this target via using two taskspecific generators and one discriminator. We also explore various optimization objectives to train better the model to produce more consistent and more stable results.
3 G\(^2\)GAN: Dual Generator Generative Adversary Networks
3.1 Model Formulation
In this work, we focus on the multidomain imagetoimage translation task with unpaired training data. The overview of the proposed G\(^2\)GAN is depicted in Fig. 2. The proposed G\(^2\)GAN model is specifically designed for tackling the multidomain translation problem with significant advantages in the model complexity and in the training overhead compared with the crossdomain generation models, such as CycleGAN [49], DiscoGAN [10] and DualGAN [47], which need to separately train \(C_m^2=\frac{m(m1)}{2}\) models for m different image domains, while ours only needs to train a single model. To directly compare with StarGAN [5], which simply employs the same generator for the different reconstruction and translation tasks. However, the training of a single generator model for multiple domains is a challenging problem (refer to Sect. 4), we proposed a more effective dual generator network structure and more robust optimization objectives to stabilize the training process. Our work focuses on exploring different strategies to improve the optimization of the multidomain model aiming to give useful insights in the design of more effective multidomain generators.
Our goal is to learn all the mappings among multiple domains using dual generators and one discriminator. To achieve this target, we train a translation generator \(G^t\) to convert an input image x into an output image y which is conditioned on the target domain label \(z_y\), i.e. \(G^t(x, z_y){\rightarrow }y\). Then the reconstruction generator \(G^r\) accepts the generated image \(G^t(x, z_y)\) and the original domain label \(z_x\) as input, and learns to reconstruct the input image x, i.e. \(G^r(G^t(x, z_y), z_x){\rightarrow }x\) through the proposed different optimization losses, including the color cycleconsistency loss for solving the “channel pollution” issue and the MSSSIM loss for preserving the information of luminance, contrast and structure across scales. The dual generators are taskspecific generators which allows for different network designs and different levels of parameter sharing for learning better the generators. The discriminator D tries to distinguish between the real image y and the generated image \(G^t(x, z_y)\), and to classify the generated image \(G^t(x, z_y)\) to the target domain label \(z_y\) via the domain classification loss. We further investigate how the distinct network designs and different network sharing schemes for the dual generators dealing with different subtasks could balance the generation performance and the network complexity. The multidomain model StarGAN [5] did not consider these aspects.
3.2 Model Optimization
The optimization objective of the proposed G\(^2\)GAN contains five different losses, i.e., color cycleconsistency loss, multiscale SSIM loss, conditional least square loss, domain classification loss and conditional identity preserving loss. These optimization losses are jointly embedded into the model during training. We present the details of these loss functions in the following.
3.3 Implementation Details
G\(^2\)GAN Architecture. The network consists of a dual generator and a discriminator. The dual generator is designed to specifically deal with different tasks in GANs, i.e. the translation and the reconstruction tasks, which has different targets for training the network. We can design different network structures for the different generators to make them learn better taskspecific objectives. This also allows us to share parameters between the generators to further reduce the model capacity, since the shallow image representations are sharable for both generators. The parameter sharing facilitates the achievement of good balance between the model complexity and the generation quality. Our model generalizes the model of StarGAN [5]. When the parameters are fully shared with the usage of the same network structure for both generators, our basic structure becomes a StarGAN. For the discriminator, we employ PatchGAN [5, 8, 13, 49]. After the discriminator, a convolution layer is applied to produce a final onedimensional output which indicates whether local image patches are real or fake.
Description of the datasets used in our experiments.
Dataset  Type  # Domain  # Translation  Resolution  Unpaired/Paired  # Training  # Testing  # Total 

Facades [40]  Architectures  2  2  256 \(\times \) 256  Paired  800  212  1,012 
AR [22]  Faces  4  12  768 \(\times \) 576  Paired  920  100  1,020 
Bu3dfe [48]  Faces  7  42  512 \(\times \) 512  Paired  2,520  280  2,800 
Alps [1]  Natural seasons  4  12  –  Unpaired  6,053  400  6,453 
RaFD [12]  Faces  8  56  1024 \(\times \) 681  Unpaired  5,360  2,680  8,040 
Collection [49]  Painting style  5  20  256 \(\times \) 256  Unpaired  7,837  1,593  9,430 
4 Experiments
In this section, we first introduce the experimental setup, and then show detailed qualitative and quantitative results and model analysis.
4.1 Experimental Setup
Datasets. We employ six publicly available datasets to validate our G\(^2\)GAN. A detailed comparison of these datasets is shown in Table 1, including Facades, AR Face, Alps Season, Bu3dfe, RaFD and Collection style datasets.
Parameter Setting. The initial learning rate for Adam optimizer is 0.0002, and \(\beta _1\) and \(\beta _2\) of Adam are set to 0.5 and 0.999. The parameters \(\lambda _1, \lambda _2, \lambda _3, \lambda _4\) in Eq. 9 are set to 1, 10, 1, 0.5, respectively. The parameters \(C_1\) and \(C_2\) in Eq. 4 are set to \(0.01^2\) and \(0.03^3\). The proposed G\(^2\)GAN is implemented using deep learning framework PyTorch. Experiments are conducted on an NVIDIA TITAN Xp GPU.
Baseline Models. We consider several stateoftheart crossdomain image generation models, i.e. CycleGAN [49], DistanceGAN [3], Dist. + Cycle [3], Self Dist. [3], DualGAN [47], ComboGAN [1], BicycleGAN [50], pix2pix [8] as our baselines. For comparison, we train these models multiple times for every pair of two different image domains except for ComboGAN [1], which needs to train m models for m different domains. We also employ StarGAN [5] as a baseline which can perform multidomain image translation using one generator/discriminator pair. Note that the fully supervised pix2pix and BicycleGAN are trained on paired data, the other baselines and G\(^2\)GAN are trained with unpaired data. Since BicycleGAN can generate several different outputs with one single input image, and we randomly select one output from them for comparison. For a fair comparison, we reimplement baselines using the same training strategy as our approach.
4.2 Comparison with the StateoftheArt on Different Tasks
We evaluate the proposed G\(^2\)GAN on four different tasks, i.e., label\(\leftrightarrow \)photo translation, facial expression synthesis, season translation and painting style transfer. The comparison with the stateofthearts are described in the following.
Task 2: Facial Expression Synthesis. We adopt three face datasets (i.e. AR, Bu3dfe and RaFD) for the facial expression synthesis task with similar settings as in StarGAN. Note that for AR dataset, we not only show the translation results of the neutral expression to other nonneutral expressions as in [5], but also present the opposite mappings, i.e. from nonneutral expressions to neutral expression. For Bu3dfe dataset, we only show the translation results from neutral expression to other nonneutral expressions as in [5] because of the space limitation. As can be seen in Fig. 4, Dist. + Cycle and Self Dist. fail to produce faces similar to the target domain. DualGAN generates reasonable but blurry faces. DistanceGAN, StarGAN, pix2pix and BicycleGAN produce much sharper results, but still contain some artifacts in the generated faces, e.g., twisted mouths of StarGAN, pix2pix and BicycleGAN on “neutral2fear” task. CycleGAN, ComboGAN and G\(^2\)GAN work better than other baselines on this dataset. We can also observe similar results on the Bu3dfe dataset as shown in Fig. 5 (Left). Finally, we present results on the RaFD dataset in Fig. 5 (Right). We can observe that our method achieves visually better results than CycleGAN and StarGAN.
Results on RaFD.
AMT “real vs fake” study on Facades, AR, Alps, Bu3dfe datasets.
% Turkers label real  label\(\rightarrow \)photo  photo\(\rightarrow \)label  AR  Alps  Bu3dfe 

CycleGAN [49] (ICCV 2017)  8.8% ± 1.5%  4.8% ± 0.8%  24.3% ± 1.7%  39.6% ± 1.4%  16.9% ± 1.2% 
DualGAN [47] (ICCV 2017)  0.6% ± 0.2%  0.8% ± 0.3%  1.9% ± 0.6%  18.2% ± 1.8%  3.2% ± 0.4% 
ComboGAN [1] (CVPR 2018)  4.1% ± 0.5%  0.2% ± 0.1%  4.7% ± 0.9%  34.3% ± 2.2%  25.3% ± 1.6% 
DistanceGAN [3] (NIPS 2017)  5.7% ± 1.1%  1.2% ± 0.5%  2.7% ± 0.7%  4.4% ± 0.3%  6.5% ± 0.7% 
Dist. + Cycle [3] (NIPS 2017)  0.3% ± 0.2%  0.2% ± 0.1%  1.3% ± 0.5%  3.8% ± 0.6%  0.3% ± 0.1% 
Self Dist. [3] (NIPS 2017)  0.3% ± 0.1%  0.1% ± 0.1%  0.1% ± 0.1%  5.7% ± 0.5%  1.1% ± 0.3% 
StarGAN [5] (CVPR 2018)  3.5% ± 0.7%  1.3% ± 0.3%  4.1% ± 1.3%  8.6% ± 0.7%  9.3% ± 0.9% 
pix2pix [8] (CVPR 2017)  4.6% ± 0.5%  1.5% ± 0.4%  2.8% ± 0.6%  –  3.6% ± 0.5% 
BicycleGAN [50] (NIPS 2017)  5.4% ± 0.6%  1.1% ± 0.3%  2.1% ± 0.5%  –  2.7% ± 0.4% 
G\(^2\)GAN (Ours, fullysharing)  4.6% ± 0.9%  2.4% ± 0.4%  6.8% ± 0.6%  15.4% ± 1.9%  13.1% ± 1.3% 
G\(^2\)GAN (Ours, partiallysharing)  8.2% ± 1.2%  3.6% ± 0.7%  16.8% ± 1.2%  36.7% ± 2.3%  18.9% ± 1.1% 
G\(^2\)GAN (Ours, nosharing)  10.3% ± 1.6%  5.6% ± 0.9%  22.8% ± 1.9%  47.7% ± 2.8%  23.6% ± 1.7% 
Quantitative Comparison on All Tasks. We also provide quantitative results on the four tasks. Different metrics are considered including: (i) AMT perceptual studies [8, 49], (ii) Inception Score (IS) [32], (iii) Fréchet Inception Distance (FID) [7] and (iv) Classification Accuracy (CA) [5]. We follow the same perceptual study protocol from CycleGAN and StarGAN. Tables 2, 3 and 4 report the performance of the AMT perceptual test, which is a “real vs fake” perceptual metric assessing the realism from a holistic aspect. For Facades dataset, we split it into two subtasks as in [49], label\(\rightarrow \)photo and photo\(\rightarrow \)label. For the other datasets, we report the average performance of all mappings. Note that from Tables 2, 3 and 4, the proposed G\(^2\)GAN achieves very competitive results compared with the other baselines. Note that G\(^2\)GAN significantly outperforms StarGAN trained using one generator on most of the metrics and on all the datasets. Note that paired pix2pix shows worse results than unpaired methods in Table 4, which can be also observed in DualGAN [47].
We also use the Inception Score (IS) [32] to measure the quality of generated images. Tables 2 and 5 report the results. As discussed before, the proposed G\(^2\)GAN generates sharper, more photorealistic and reasonable results than Dist. + Cycle, Self Dist. and StarGAN, while the latter models present slightly higher IS. However, higher IS does not necessarily mean higher image quality. High quality images may have small IS as demonstrated in other image generation [19] and superresolution works [9, 34]. Moreover, we employ FID [7] to measure the performance on RaFD and painting style datasets. Results are shown in Tables 2 and 3, we observe that G\(^2\)GAN achieves the best results compared with StarGAN and CycleGAN.
Finally, we compute the Classification Accuracy (CA) on the synthesized images as in [5]. We train classifiers on the AR, Alps, Bu3dfe, Collection datasets respectively. For each dataset, we take the real image as training data and the generated images of different models as testing data. The intuition behind this setting is that if the generated images are realistic and follow the distribution of the images in the target domain, the classifiers trained on real images will be able to classify the generated image correctly. For AR, Alps and Collection datasets we list top 1 accuracy, while for Bu3dfe we report top 1 and top 5 accuracy. Tables 3 and 5 show the results. Note that G\(^2\)GAN outperforms the baselines on AR, Bu3dfe and Collection datasets. On the Alps dataset, StarGAN achieves slightly better performance than ours but the generated images by our model contains less artifacts than StarGAN as shown in Fig. 6.
4.3 Model Analysis
Results of Inception Score (IS) and Classification Accuracy (CA).
Model  Facades  AR  Alps  Bu3dfe  

IS  IS  CA  IS  CA  IS  CA  
CycleGAN [49] (ICCV 2017)  3.6098  2.8321  @1:27.333%  4.1734  @1:42.250%  1.8173  @1:48.292%, @5:94.167% 
DualGAN [47] (ICCV 2017)  3.7495  1.9148  @1:28.667%  4.2661  @1:53.488%  1.7176  @1:40.000%, @5:90.833% 
ComboGAN [1] (CVPR 2018)  3.1289  2.4750  @1:28.250%  4.2438  @1:62.750%  1.7887  @1:40.459%, @5:90.714% 
DistanceGAN [3] (NIPS 2017)  3.9988  2.3455  @1:26.000%  4.8047  @1:31.083%  1.8974  @1:46.458%, @5:90.000% 
Dist. + Cycle [3] (NIPS 2017)  2.6897  3.5554  @1:14.667%  5.9531  @1:29.000%  3.4618  @1:26.042%, @5:79.167% 
Self Dist. [3] (NIPS 2017)  3.8155  2.1350  @1:21.333%  5.0584  @1:34.917%  3.4620  @1:10.625%, @5:74.167% 
StarGAN [5] (CVPR 2018)  4.3182  2.0290  @1:26.250%  3.3670  @1:65.375%  1.5640  @1:52.704%, @5:94.898% 
pix2pix [8] (CVPR 2017)  3.6664  2.2849  @1:22.667%      1.4575  @1:44.667%, @5:91.750% 
BicycleGAN [50] (NIPS 2017)  3.2217  2.0859  @1:28.000%      1.7373  @1:45.125%, @5:93.125% 
G\(^2\)GAN (Ours, fullysharing)  4.2615  2.3875  @1:28.396%  3.6597  @1:61.125%  1.9728  @1:52.985%, @5:95.165% 
G\(^2\)GAN (Ours, partiallysharing)  4.1689  2.4846  @1:28.835%  4.0158  @1:62.325%  1.5896  @1:53.456%, @5:95.846% 
G\(^2\)GAN (Ours, nosharing)  4.0819  2.6522  @1:29.667%  4.3773  @1:63.667%  1.8714  @1:55.625%, @5:96.250% 
Evaluation of different variants of G\(^2\)GAN on Facades, AR and Bu3dfe datasets. All: full version of G\(^2\)GAN, I: Identity preserving loss, S: multiscale SSIM loss, C: Color cycleconsistency loss, D: Double discriminators strategy.
Model  label\(\rightarrow \)photo  photo\(\rightarrow \)label  AR  Bu3dfe  

% Turkers label real  % Turkers label real  % Turkers label real  CA  % Turkers label real  CA  
All  10.3% ± 1.6%  5.6% ± 0.9%  22.8% ± 1.9%  @1:29.667%  23.6% ± 1.7%  @1:55.625%, @5:96.250% 
All  I  2.6% ± 0.4%  4.2% ± 1.1%  4.7% ± 0.8%  @1:29.333%  16.3% ± 1.1%  @1:53.739%, @5:95.625% 
All  S  C  4.4% ± 0.6%  4.8% ± 1.3%  8.7% ± 0.6%  @1:28.000%  14.4% ± 1.2%  @1:42.500%, @5:95.417% 
All  S  C  I  2.2% ± 0.3%  3.9% ± 0.8%  2.1% ± 0.4%  @1:24.667%  13.6% ± 1.2%  @1:41.458%, @5:95.208% 
All  D  9.0% ± 1.5%  5.3% ± 1.1%  21.7% ± 1.7%  @1:28.367%  22.3% ± 1.6%  @1:53.375%, @5:95.292% 
All  D  S  3.3% ± 0.7%  4.5% ± 1.1%  14.7% ±1.7%  @1:27.333%  20.1% ± 1.4%  @1:42.917%, @5:91.250% 
All  D  C  8.7% ± 1.3%  5.1% ± 0.9%  19.4% ± 1.5%  @1:28.000%  21.6% ± 1.4%  @1:45.833%, @5:93.875% 
Comparison of the overall model capacity with different models.
Method  # Models  # Parameters with \(m=7\) 

pix2pix [8] (CVPR 2017)  \(A_m^2=m(m1)\)  57.2M \(\times \) 42 
BicycleGAN [50] (NIPS 2017)  64.3M \(\times \) 42  
CycleGAN [49] (ICCV 2017)  \(C_m^2=\frac{m(m1)}{2}\)  52.6M \(\times \) 21 
DiscoGAN [10] (ICML 2017)  16.6M \(\times \) 21  
DualGAN [47] (ICCV 2017)  178.7M \(\times \) 21  
DistanceGAN [3] (NIPS 2017)  52.6M \(\times \) 21  
ComboGAN [1] (CVPR 2018)  m  14.4M \(\times \) 7 
StarGAN [5] (CVPR 2018)  1  53.2M \(\times \) 1 
G\(^2\)GAN (Ours, fullysharing)  1  53.2M \(\times \) 1 
G\(^2\)GAN (Ours, partialsharing)  1  53.8M \(\times \) 1 
G\(^2\)GAN (Ours, nosharing)  1  61.6M \(\times \) 1 
Overall Model Capacity Analysis. We compare the overall model capacity with other baselines. The number of models and the number of model parameters on Bu3dfe dataset for different m image domains are shown in Table 7. BicycleGAN and pix2pix are supervised models so that they need to train \(A_m^2\) models for m image domains. CycleGAN, DiscoGAN, DualGAN, DistanceGAN are unsupervised methods, and they require \(C_m^2\) models to learn m image domains, but each of them contains two generators and two discriminators. ComboGAN requires only m models to learn all the mappings of m domains, while StarGAN and G\(^2\)GAN only need to train one model to learn all the mappings of m domains. We also report the number of parameters on Bu3dfe dataset, this dataset contains 7 different expressions, which means \(m=7\). Note that DualGAN uses fully connected layers in the generators, which brings significantly larger number of parameters. CycleGAN and DistanceGAN have the same architectures, which means they have the same number of parameters. Moreover, G\(^2\)GAN uses less parameters compared with the other baselines except StarGAN, but we achieve significantly better generation performance in most metrics as shown in Tables 2, 3, 4 and 5. When we employ the parameter sharing scheme, our performance is only slightly lower (still outperforming StarGAN) while the number of parameters is comparable with StarGAN.
5 Conclusion
We propose a novel Dual Generator Generative Adversarial Network (G\(^2\)GAN), a robust and scalable generative model that allows performing unpaired imagetoimage translation for multiple domains using only dual generators within a single model. The dual generators, allowing for different network structures and differentlevel parameter sharing, are designed for the translation and the reconstruction tasks. Moreover, we explore jointly using different loss functions to optimize the proposed G\(^2\)GAN, and thus generating images with high quality. Extensive experiments on different scenarios demonstrate that the proposed G\(^2\)GAN achieves more photorealistic results and less model capacity than other baselines. In the future, we will focus on the face aging task [42], which aims to generate facial image with different ages in a continuum.
References
 1.Anoosheh, A., Agustsson, E., Timofte, R., Van Gool, L.: Combogan: unrestrained scalability for image domain translation. In: CVPR Workshop (2018)Google Scholar
 2.Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN. In: ICML (2017)Google Scholar
 3.Benaim, S., Wolf, L.: Onesided unsupervised domain mapping. In: NIPS (2017)Google Scholar
 4.Brock, A., Lim, T., Ritchie, J.M., Weston, N.: Neural photo editing with introspective adversarial networks. In: ICLR (2017)Google Scholar
 5.Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: StarGAN: unified generative adversarial networks for multidomain imagetoimage translation. In: CVPR (2018)Google Scholar
 6.Goodfellow, I., et al.: Generative adversarial nets. In: NIPS (2014)Google Scholar
 7.Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two timescale update rule converge to a local Nash equilibrium. In: NIPS (2017)Google Scholar
 8.Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Imagetoimage translation with conditional adversarial networks. In: CVPR (2017)Google Scholar
 9.Johnson, J., Alahi, A., FeiFei, L.: Perceptual losses for realtime style transfer and superresolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/9783319464756_43CrossRefGoogle Scholar
 10.Kim, T., Cha, M., Kim, H., Lee, J., Kim, J.: Learning to discover crossdomain relations with generative adversarial networks. In: ICML (2017)Google Scholar
 11.Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
 12.Langner, O., Dotsch, R., Bijlstra, G., Wigboldus, D.H., Hawk, S.T., Van Knippenberg, A.: Presentation and validation of the radboud faces database. Cogn. Emot. 24(8), 1377–1388 (2010)CrossRefGoogle Scholar
 13.Li, C., Wand, M.: Precomputed realtime texture synthesis with markovian generative adversarial networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 702–716. Springer, Cham (2016). https://doi.org/10.1007/9783319464879_43CrossRefGoogle Scholar
 14.Li, T., et al.: BeautyGAN: instancelevel facial makeup transfer with deep generative adversarial network. In: ACM MM (2018)Google Scholar
 15.Li, Y., Liu, S., Yang, J., Yang, M.H.: Generative face completion. In: CVPR (2017)Google Scholar
 16.Liang, X., Zhang, H., Xing, E.P.: Generative semantic manipulation with contrasting GAN. In: ECCV (2018)Google Scholar
 17.Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised imagetoimage translation networks. In: NIPS (2017)Google Scholar
 18.Liu, M.Y., Tuzel, O.: Coupled generative adversarial networks. In: NIPS (2016)Google Scholar
 19.Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., Van Gool, L.: Pose guided person image generation. In: NIPS (2017)Google Scholar
 20.Mansimov, E., Parisotto, E., Ba, J.L., Salakhutdinov, R.: Generating images from captions with attention. In: ICLR (2015)Google Scholar
 21.Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Smolley, S.P.: Least squares generative adversarial networks. In: ICCV (2017)Google Scholar
 22.Martinez, A.M.: The AR face database. CVC TR (1998)Google Scholar
 23.Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
 24.Nguyen, T., Le, T., Vu, H., Phung, D.: Dual discriminator generative adversarial nets. In: NIPS (2017)Google Scholar
 25.Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classifier GANs. In: ICML (2017)Google Scholar
 26.Park, E., Yang, J., Yumer, E., Ceylan, D., Berg, A.C.: Transformationgrounded image generation network for novel 3D view synthesis. In: CVPR (2017)Google Scholar
 27.Perarnau, G., van de Weijer, J., Raducanu, B., Álvarez, J.M.: Invertible conditional GANs for image editing. In: NIPS Workshop (2016)Google Scholar
 28.Qi, G.J.: Losssensitive generative adversarial networks on Lipschitz densities. arXiv preprint arXiv:1701.06264 (2017)
 29.Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial texttoimage synthesis. In: ICML (2016)Google Scholar
 30.Reed, S.E., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H.: Learning what and where to draw. In: NIPS (2016)Google Scholar
 31.Ruder, S.: An overview of multitask learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017)
 32.Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: NIPS (2016)Google Scholar
 33.Sangkloy, P., Lu, J., Fang, C., Yu, F., Hays, J.: Scribbler: controlling deep image synthesis with sketch and color. In: CVPR (2017)Google Scholar
 34.Shi, W., et al.: Realtime single image and video superresolution using an efficient subpixel convolutional neural network. In: CVPR (2016)Google Scholar
 35.Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R.: Learning from simulated and unsupervised images through adversarial training. In: CVPR (2017)Google Scholar
 36.Shu, Z., Yumer, E., Hadap, S., Sunkavalli, K., Shechtman, E., Samaras, D.: Neural face editing with intrinsic image disentangling. In: CVPR (2017)Google Scholar
 37.Siarohin, A., Sangineto, E., Lathuilière, S., Sebe, N.: Deformable GANs for posebased human image generation. In: CVPR (2018)Google Scholar
 38.Taigman, Y., Polyak, A., Wolf, L.: Unsupervised crossdomain image generation. In: ICLR (2017)Google Scholar
 39.Tang, H., Wang, W., Xu, D., Yan, Y., Sebe, N.: GestureGAN for hand gesturetogesture translation in the wild. In: ACM MM (2018)Google Scholar
 40.Tyleček, R., Šára, R.: Spatial pattern templates for recognition of objects with regular structure. In: Weickert, J., Hein, M., Schiele, B. (eds.) GCPR 2013. LNCS, vol. 8142, pp. 364–374. Springer, Heidelberg (2013). https://doi.org/10.1007/9783642406027_39CrossRefGoogle Scholar
 41.Wang, W., AlamedaPineda, X., Xu, D., Fua, P., Ricci, E., Sebe, N.: Every smile is unique: Landmarkguided diverse smile generation. In: CVPR (2018)Google Scholar
 42.Wang, W., Yan, Y., Cui, Z., Feng, J., Yan, S., Sebe, N.: Recurrent face aging with hierarchical autoregressive memory. In: IEEE TPAMI (2018)Google Scholar
 43.Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP 13(4), 600–612 (2004)Google Scholar
 44.Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multiscale structural similarity for image quality assessment. In: Asilomar Conference on Signals, Systems and Computers (2003)Google Scholar
 45.Xu, R., Zhou, Z., Zhang, W., Yu, Y.: Face transfer with generative adversarial network. arXiv preprint arXiv:1710.06090 (2017)
 46.Yeh, R., Chen, C., Lim, T.Y., HasegawaJohnson, M., Do, M.N.: Semantic image inpainting with perceptual and contextual losses. In: CVPR (2017)Google Scholar
 47.Yi, Z., Zhang, H., Gong, P.T., et al.: DualGAN: unsupervised dual learning for imagetoimage translation. In: ICCV (2017)Google Scholar
 48.Yin, L., Wei, X., Sun, Y., Wang, J., Rosato, M.J.: A 3D facial expression database for facial behavior research. In: FGR (2006)Google Scholar
 49.Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired imagetoimage translation using cycleconsistent adversarial networks. In: ICCV (2017)Google Scholar
 50.Zhu, J.Y., et al.: Toward multimodal imagetoimage translation. In: NIPS (2017)Google Scholar