1 Introduction

With the development of deep learning technology, image-to-image translation has received great attention in recent years. Many computer vision tasks can be handled under the framework of image-to-image translation framework, such as image restoration and image enhancement. As we all know, the Generative Adversarial Network (GAN) framework is one of the most popular frameworks in this application field, and it has very powerful functions and vigorous vitality. And it has been significantly improved in some special image processing tasks, such as image super-resolution [1,2,3], semantic segmentation [4], image inpainting [5,6,7], etc.

In Generative Adversarial Network (GAN), the generator G is trained to increase the probability that fake data is real and the discriminator D tries to distinguish if one input image x is natural and realistic. Different from the standard discriminator D, Alexia et al. [8, 9] argues that it should also simultaneously increase the probability that fake data is real and decrease the probability that real data is real. They introduced this property by changing the discriminator into a relativistic form, which calculates the probability that a real image is relatively more natural and realistic than a fake one.

However, this discriminator is in the relativistic form of a single scale, which can still lead to artifacts in the generated image. To solve this problem, we propose an Enhanced CycleGAN (ECGAN) framework with multi-scale relativistic average discriminator. To sum it up, we improve the key components of original CycleGAN model in three aspects:

  • We improve the discriminator by introducing Multi-Scale Relativistic average Discriminator (MS-RaD), which tries to distinguish on different scales whether one image is more realistic than the other, rather than whether one image is real or fake.

  • We add a complementary loss item calculating between the synthesized images and real images while the cycle-consistency loss calculating between the cycled images and real images. The final complementary cycle-consistent loss is able to increases the quality and reduce artifacts of the results.

  • We introduce the Residual-in-Residual Dense Block (RDDB) to generator G as its basic module. This makes the network be of higher capacity and easier to train. We don’t apply Batch Normalization (BN)[10] or Instance Normalization (IN)[11] layers in generator as in [1].

Experiments show that those improvements help the generator create more realistic texture and details. Extensive experiments display that the ECGAN, outperforms the state-of-the-art methods on both similarity metrics and perceptual scores.

2 Related Works

For convenience of description, we define the task of image-to-image translation as follows: given a series of images \(\{I_{X}^i\}_{i=1}^N\) from source domain X and \(\{I_{Y}^j\}_{j=1}^M\) from target domain Y, we find the mapping relationship between the same image from a source domain X to a target domain Y, denoted by \({\mathcal {T}:I_X \rightarrow I_Y}\). Many methods [12,13,14,15] have been proposed to solve this problem. Among them, the most famous framework is the pix2pix generalized framework (proposed by Isola et al. [16]) based on Conditional GANs (cGANs).

Zhu et al. [17] presented CycleGAN using Cycle-Consistent Loss to handle the paired training data limitation of pix2pix. While converting an image from a source domain X to a target domain Y referred to as \({\mathcal {T}:I_X \rightarrow I_Y}\), cycle consistency loss manages to enforce \({\mathcal {F}(\mathcal {T}(I_X))\approx I_X}\) by introducing an inverse mapping \({\mathcal {F}:I_Y\rightarrow I_X}\) (and vice versa). Except for original adversarial losses and cycle consistency loss, perceptual loss [18] or \(L_1\) loss is also used to improve the quality of synthesised images in many works afterwards [19].

Isola et al. [16] applies the U-Net [20] architecture generator and patch-based discriminator. Besides U-Net [20], ResNet [21] is another popular architecture for generator. It use Residual block as basic module. Wang et al. [14] made it possible for pix2pix to synthesize \(2048\times 1024\) high-resolution photo-realistic images from semantic label maps [22, 23]. They improved original pix2pix framework by using a coarse-to-fine generator and a multi-scale discriminator. Coarse-to-fine generator can be decomposed into a global generator network and a local enhancer network. Multi-scale discriminator [14] are composed of three discriminators that share identical structure but operate at three different scales. Those discriminators are respectively trained with real and synthesized images at different scales. The discriminator at the coarsest scale can generate globally consistent images, while the finest one produces finer details.

Fig. 1.
figure 1

complementary cycle-consistent loss. Left: loss with paired samples. Middle: original cycle-consistent coss. Right: complementary cycle-consistentcoss. The complementary loss item is shown as the red dash line and is calculated as \(L_1\) loss between the synthesized images and real images. (Color figure online )

3 ECGAN with Multi-Scale Relativistic Average Discriminator

Similar to the aforementioned definition, given a series of images \(\{I_{X}^i\}_{i=1}^N\) from source domain X and \(\{I_{Y}^j\}_{j=1}^M\) from target domain Y, our goal is to learn a mapping \({\mathcal {T}:I_X \rightarrow I_Y}\) from the source domain X to the target domain Y such that the distribution of images from \(\mathcal {T}(I_X)\) is indistinguishable from the distribution \(I_Y\).

There are mainly three modifications at the structure of network: (1) use Multi-Scale Relativistic average Discriminator (MS-RaD); (2) add a complementary loss item to the original cycle-consistent loss; (3) replace the original basic residual blocks with the Residual-in-Residual Dense Block (RRDB) and remove all BN or IN layers in generator G. The following will be described in detail one by one.

3.1 Multi-Scale Relativistic Average Discriminator

In standard GAN, the discriminator usually be defined as \(D(x)=\sigma (C(x))\), where \(\sigma \) means activation function and C(x) represents the output of the discriminator. The simplest way to make it relativistic [8], i.e, making the output of D on both real and fake data, is to define it as \(D(x)=\sigma (C(x_r)-C(x_f))\) with samples from real and fake pairs \(\hat{x}(x_r,x_f)\), where, subscripts r and f denote real image and fake image, respectively.

Rather than judging the probability that the input data is real, relativistic discriminators measure the probability that the input data is relatively more realistic than a randomly sampled data of its counterpart. To make it more globally, we can focus on the average of the relativistic discriminator: \(D(x)=\sigma (C(x_r)-E(C(x_f)))\), as shown in Fig. 2.

Fig. 2.
figure 2

The different between standard discriminator and relativistic average discriminator [3]. Left: Standard discriminator judges the probability that the input data is real or fake. Right: Relativistic average discriminator judges the probability that a real (or fake) image is relatively more realistic than a fake (or real) one.

Here we extend the relativistic design of discriminators into multiple different scales,

$$\begin{aligned} \min \limits _{G} \max \limits _{\{D_k\}_{k=1}^N}{\sum \limits _{k}L_D(G,D_k)}, \end{aligned}$$
(1)

\(L_D\) can be formulated as:

$$\begin{aligned} \begin{aligned} L_D(G,D_k)=&E_{x_r^k\sim \mathbb {P}}[\sigma (C(x_r^k))-E_{x_f^k \sim \mathbb {Q}}C(x_f^k)]\\&+ E_{x_f^k\sim \mathbb {Q}}[\sigma (C(x_f^k))-E_{x_r^k\sim \mathbb {P}}C(x_r^k)], \end{aligned} \end{aligned}$$
(2)

where \(\mathbb {P}\) represents the distribution of real data, \(\mathbb {Q}\) represents the distribution of fake data and D(x) is the discriminator evaluated at x.

3.2 Complementary Cycle-Consistent Loss

Our goal is to learn a mapping \({\mathcal {T}:I_X \rightarrow I_Y}\) such that the distribution of images from \(\mathcal {T}(I_X)\) is indistinguishable from the distribution \(I_Y\) using an adversarial loss. Because this mapping is highly under-constrained, we couple it with an inverse mapping \({\mathcal {F}:I_Y\rightarrow I_X}\) and introduce a cycle consistency loss to enforce \({\mathcal {F}(\mathcal {T}(I_X))\approx I_X}\) (and vice versa).

Cycle-consistent loss helps learn the translation mapping \({\mathcal {T}:I_X \rightarrow I_Y}\) coupled with an inverse mapping \({\mathcal {F}:I_Y\rightarrow I_X}\), but it’s still highly under-constrained. As depicted in Fig. 1, cycle-consistent loss is calculated as \(L_1\) loss between real images \(I_X\) and cycled images \(I_X'' \triangleq \mathcal {F}(\mathcal {T}(I_X))\) in domain \(I_X\) , real image \(I_Y\) and cycled image \(I_Y'' \triangleq \mathcal {T}(\mathcal {F}(I_Y))\) in domain Y. We observe that the relationship between real images \(I_X\), \(I_Y\) and the synthesized image \(I_X'\triangleq \mathcal {F}(I_Y)\), \(I_Y'\triangleq \mathcal {T}(I_X)\) is missing. The translation \(I_X'\triangleq \mathcal {F}(I_Y)\) and \(I_Y'\triangleq \mathcal {T}(I_X)\) are not specifically constrained. So we add this term to the original loss and name the final loss as complementary cycle-consistent loss.

The final complementary cycle-consistent loss is defined as follows:

$$\begin{aligned} \begin{aligned} L_{CCCL_X}&=\underbrace{||I_X-I_X''||_1}_{\text {cycle-consistent loss}}+\lambda _{1}||I_X-I_X'||_1, \\ L_{CCCL_Y}&=\underbrace{||I_Y-I_Y''||_1}_{\text {cycle-consistent loss}}+\lambda _{2}||I_Y-I_Y'||_1 \end{aligned} \end{aligned}$$
(3)

3.3 Residual-in-Residual Dense Block

Some previous works observe that more layers and connections could boost performance [1, 24, 25]. Zhang et al. [24] employs a multi-level residual network. Wang et al. [3] proposes a similar residual-in-residual structure, where the network capacity becomes higher benefiting from deeper and more complex structure. We replace original residual block with this Residual Block with Residual-in-Residual Dense Block (RRDB). The basic structure of RRDB are depicted in Fig. 3.

Fig. 3.
figure 3

The different between RB and RRDB [3]. Left: Residual Block with or without Batch Normalization (BN) layers. Right: RRDB block (\(\beta \) is the residual scaling parameter).

We empirically observe that Batch Normalization layers [10] tend to bring artifacts also in image translation processing as Wang et al. [3] found in super-resolution, which they called BN artifacts. Removing all Batch Normalization layers achieves stable and consistent performance without artifacts, but reduces memory usage and computational resources dramatically.

The Generator and Discriminator architectures in our work are adapted from CycleGAN [17]. We replace Residual Block with Residual-in-Residual Dense Block using no Batch Normalization Layers.

4 Experiment

4.1 Datasets

To appraise the efficiency of the proposed method, we conducted some experiments on two benchmark datasets, namely CUHK and Facades. First, we describe the datasets in brief here.

  1. (1)

    CUHK Face Sketch Database [26]: The CUHK dataset consists of 188 face image pairs of sketch and corresponding face of students. We use its \(256\times 256\times 3\) resized and cropped version in our experiment. 100 images are used for the training and rest for the testing.

  2. (2)

    CMP Facade Database [27]: The Facade Database present facade images from different cities around the world and diverse styles. It includes 606 rectified images pairs of labels and corresponding facades with dimensions of \(256\times 256\times 3\). 400 pairs are used for the training while the others remaining for the testing.

Fig. 4.
figure 4

Qualitative comparison on CUFS dataset. From left to right: Input, Ground truth, CycleGAN, CSGAN and ours. Our method generates the realistic and natural images with less artifacts.

Fig. 5.
figure 5

Qualitative comparison on Facades dataset. From left to right: Input, Ground truth, CycleGAN, CSGAN and ours. Our method generates the realistic and natural images with less artifacts.

4.2 Evaluation Metrics

The quantitative and qualitative results are computed to evaluate the performance of the proposed method. The Structural Similarity Index (SSIM)[28], Peak Signal to Noise Ratio (PSNR) and the Frechet Inception distance (FID)[29] are adapted to assess the results.

PSNR and SSIM are Full Reference Image Quality Assessment (FR-IQA)[30, 31] metrics, usually applied to judge the similarity between the generated image and ground truth image in many tasks such as image enhancement [32, 33], image de-raining [34,35,36] and super-resolution [1,2,3, 37].

FID [29] calculates the Wasserstein distance between the synthesized and real images in the feature space of an Inception-v3 network [38]. Lower FID score means the distance between synthetic and real image distributions are closer. It’s has been shown to be more principled, comprehensive and consistent with human evaluation in diversity and realism of the synthesized images. The qualitative comparison results of SSIM, PSNR and FID are shown in Figs. 4 and 5.

Table 1. The average scores of the SSIM, MSE, PSNR, FID on CUHK and Facedes Dataset. The values in bold highlights the best values.

4.3 Training Information

We train with ADAM [39] optimizer by setting \({\beta _1}= 0.9\), \({\beta _2} = 0.999\) and a learning rate of 0.0002. The joint training of generator and discriminator networks are performed. We perform the training for 200 epochs with batch size 1, which took roughly 5 hours on an i7 with 8 GB of memory and a GeForce GTX 1080 Ti GPU.

4.4 Experimental Results and Analysis

4.5 Quantitative Evaluation

Table 1 lists the comparative results of CycleGAN [17], CSGAN [40] and the proposed method over the CUHK sketch-to-face and Facede labels-to-buildings datasets, respectively. In terms of the average scores given by SSIM and PSNR metrics, the proposed method clearly shows improved results over the others. The highest scores of SSIM and PSNR metrics show that the proposed method generates more structurally similar faces for a given sketch. The lowest FID score means it also achieves the most perceptual results.

4.6 Qualitative Evaluation

Figures 4 and 5 show the qualitative comparison of the proposed results on CUHK and Facades dataset, respectively. The results generated by CycleGAN contain different type of artifacts such as face distortion, color inconsistencies, BN artifacts [3], etc. The results of CSGAN is better, but still suffers with the BN artifacts for different images. Those unwanted side effects are significantly reduced by our method. The results of proposed method are more natural, realistic and diverse with reduced artifacts.

4.7 Ablation Experiments

We also conduct ablation study on each part. The results are shown at Table 2. Multi-RaD, CCCL, and RRDB represent Multi-Scale Relativistic average Discriminator, Complementary Cycle-Consistent Loss and the Residual-in-Residual Dense Block, respectively. It can be seen from the Table 2 that with those three components helps achieve better results than without.

Table 2. Ablation experiments on CHUK dataset.

5 Conclusion

We present an ECGAN model that achieves both structurally similar and better perceptual quality results. We firstly expend the Relativistic average Discriminator into a multi-scale form, which learns to judge on different scales whether one image is more realistic than another, leading the generator G to create more natural textures and details. A Complementary Cycle-Consistent Loss is introduced to the original CycleGAN objective function to guide the translation to the desired direction and minimize unwanted artifacts. We have also introduced the structure containing several RDDB blocks without batch normalization layers into the field of image-to-image translation. The experiment shows that the proposed method is better or comparable with the recent state-of-art methods on two benchmark image translation datasets.