Multi-feature contrastive learning for unpaired image-to-image translation

Unpaired image-to-image translation for the generation field has made much progress recently. However, these methods suffer from mode collapse because of the overfitting of the discriminator. To this end, we propose a straightforward method to construct a contrastive loss using the feature information of the discriminator output layer, which is named multi-feature contrastive learning (MCL). Our proposed method enhances the performance of the discriminator and solves the problem of model collapse by further leveraging contrastive learning. We perform extensive experiments on several open challenge datasets. Our method achieves state-of-the-art results compared with current methods. Finally, a series of ablation studies proved that our approach has better stability. In addition, our proposed method is also practical for single image translation tasks. Code is available at https://github.com/gouayao/MCL.


Introduction
Generative adversarial networks (GANs) [14] usually include two models: a generator and a discriminator. The generator aims to capture the real data distribution to generate new samples. The discriminator aims to judge an input sample's realness to identify whether it is real or fake. Because of their solid generative capability, GANs have become one of the most promising methods in the family of generative models [13]. It is widely applied in various sectors [9,37], especially in the field of image generation.
Many problems can be summarized as image-to-image translation tasks in the image generation field, such as image denoising [5], dehazing [3,28], coloring [46], makeup [29], and super-resolution [26,34,40]. The image-to-images translation aims to find a mapping between a source domain  X and a target domain Y and "translate" the input image into the corresponding output image. In general, image-toimage translation tasks can be categorized into two groups: paired (supervised) [20,35,39] and unpaired (unsupervised) [19,24,27,30,43,47]. Pix2pix [20] investigated conditional GANs (cGANs) as a general-purpose solution to imageto-image translation problems and developed a common framework for all these problems. Wang et al. [39] and Park et al. [35] extended pix2pix and further improved the quality of the generated images. These approaches require paired data for training. However, for many tasks, paired training data are challenging to obtain. It significantly limits the application of image-to-image translation. To address this problem, Zhu et al. [47] presented a Cycle-consistency GAN (CycleGAN) for learning an inverse mapping between two domains X and Y to realize image-to-image translation tasks in the absence of paired examples. Similarly, literature [24,43] also used cycle-consistency to realize unpaired image-to-image translation.
Although cycle-consistency does not require the training data to be paired, it assumes that the relationship between the two domains X and Y is a bijection, which is often too restrictive. More recently, some methods [1,2,11,31,36] have attempted to use one-sided mapping instead of two-sided mapping. In literature [36], Park et al. first applied contrastive learning to image-to-image translation tasks by learning the correspondence between input and output patches, achieving a performance superior to those based on cycle-consistency. This method is named CUT. To further leverage contrastive learning and avoid the drawbacks of cycle-consistency, Han et al. [16] proposed a dual contrastive learning approach to infer an efficient mapping between unpaired data, referred to as Dual Contrastive Learning GAN (DCLGAN). Both CUT and DCLGAN only introduce contrastive learning into the generator, making the discriminator prone to overfitting and even suffering mode collapse during training.
Previous approaches either had strict restrictions on training datasets (paired) or mapping functions (bijective), or merely considered enhancing the performance of the generator. In this paper, we propose a multi-feature contrastive learning method. Our method is a one-sided mapping method for unpaired image-to-image translation, considering enhancing the performance of the generator and discriminator. In summary, this work aims to make two contributions: (1) Our proposed method can further enhance the performance of the discriminator, prevent the discriminator overfitting issue during training. This method benefits from multi-feature contrastive learning, which is called MCL. Large amounts of experiments show that the quantitative and qualitative aspects of our method are better than those of other methods on various unpaired translation tasks. In addition, our proposed method is also applicable to single image translation tasks, as shown in Fig. 1. (2) We analyze the feature information of the discriminator output layer and construct a contrastive loss using this feature information. Our proposed loss is simple, effective, and universally applicable, called MCL loss.
Experiments show that MCL loss can be directly added to most image-to-image translation methods (such as Cycle-GAN, CUT, and DCLGAN) to improve the quality of the generated images. In addition, since we did not utilize additional model parameters, MCL loss adds little additional training time and computational resources.

Unpaired translation
GANs [14] usually consist of two models: (a) a generator G : Z → X , (b) and a discriminator D : The generator G maps a potential variable z ∼ p (z) to X to generate a sample G (z) of realness, where p (z) represents a specific prior distribution. The discriminator D maps the input sample to a probability space to distinguish between the real and the generated sample. The training process of G and D follows the following objective function: where p data , p z , and p g represent the data distribution of real samples, input potential variables, and generated samples, respectively.
For unpaired image-to-image translation tasks [16,36,47], an unpaired dataset is given: X = {x ∈ X } and Y = {y ∈ Y}. On the one hand, the generator G wants to learn a mapping G : X → Y from the source domain X to the target domain Y. On the other hand, the discriminator D hopes to distin- Fig. 1 Single image translation tasks. We try to solve several different issues with the same architecture and objective from a single image. Here the results of our approach are shown on these issues, including style transfer, makeup, dehazing, label2facade, coloring, and summer2winter guish the transformed image G (x) from the target domain image Y. At this point, the objective function of training G and D is as follows:

Contrastive learning
Contrastive learning was found to be effective in stateof-the-art unsupervised visual representation learning tasks [6,17,21,38,42]. It aims to learn a mapping function that makes representations of associated samples closer and keeps representations of other samples away. These associated samples are named positive samples, and others are named negative. For contrastive learning, how to properly construct positive and negative samples is crucial. Some recent works investigate the use of contrastive learning for image translation [1,16,31,36]. TUNIT [1] adopts contrastive losses to simultaneously separate image domains and translates input images into the estimated domains. DivCo framework [31] uses contrastive losses to properly constrain both "positive" and "negative" relations between the generated images specified in the latent space. CUT [36] uses a noise contrastive estimation framework to maximize the mutual information between input and output for improving the performance of unpaired image-to-image translation. DCLGAN [16] extends one-sided mapping to two-sided mapping to further leverage contrastive learning, performing better in learning embeddings and thus achieving state-ofthe-art results.
Note that all the above methods only introduce contrastive learning into the generator, that leads to the discriminator overfitting issue in the training process. Our proposed MCL is a novel contrastive learning strategy, which uses the feature information of the discriminator output layer to construct the contrastive loss. We further demonstrate the superiority of our method compared to several state-of-the-art methods through extensive experiments. Our method only uses existing feature information, so almost no additional computing resources and training time will be added. The specific method is described below.

Methods
Given a dataset of X = {x ∈ X } and Y = {y ∈ Y}, we aim to learn a mapping that translates an image x from a source domain X to a target domain Y. For a 70×70 Patch-GAN discriminator [20], its output layer is a 30×30 matrix A = a i, j 30×30 -each element a i, j aims to classify whether 70 × 70 overlapping image patches are real or fake. The discriminator determines whether an input image is real or fake by the expectation of all elements.
Different from previous methods [11,16,20,36,47], we also consider how to use the feature information of the discriminator output layer to construct the contrastive loss and Overall architecture of our approach. We consider constraining the generator and discriminator based on contrastive learning. Our approach includes four loss items altogether: adversarial loss, two PatchNCE losses, and MCL loss. Adversarial loss is encouraged to control the translation style. PatchNCE loss and MCL loss are used to enhance the performance of the generator and discriminator, respectively. We omit a similar PatchNCE loss L PatchNCE (G, H , Y ) here thus enhance the generalization performance of the discriminator. Figure 2 shows the overall architecture of our approach. We combine four losses, including adversarial loss, two PatchNCE losses, and MCL loss. The details of our objective are described below.

Adversarial loss
We use an adversarial loss [14] to encourage the translated images to be visually similar enough to images from the target domain, as described below:

PatchNCE loss
We use a noise contrastive estimation framework [38] to maximize the mutual information between the input and output patches. That is, a generated output patch should appear closer to its corresponding input patch and keep away from other random patches.
Following CUT [36], a query, a positive and N negatives are, respectively, mapped to K -dimensional vectors, which are defined as v, v + ∈ R K , and v − ∈ R N ×K . Note that v − n ∈ R K represents the n-th negative. In this paper, query, positive and negative refer to output, corresponding input, and noncorresponding input, respectively. Our goal is to associate positive and stay away from negatives, which can be expressed mathematically as a cross-entropy loss [15]: We normalize vectors onto a unit sphere to prevent the space from collapsing or expanding. We use a temperature parameter τ = 0.07 as default.
Like CUT [36], the generator is divided into two components: an encoder G e and a decoder G d , applied sequentially to produce the output image We select L layers from G e (x) and send it to a small two-layer MLP network H l , producing a stack of features represents the output of the lth chosen layer. Then, we index into layers l ∈ {1, 2, . . . , L} and denote s ∈ {1, . . . , S l }, where S l is the number of spatial locations in each layer. We refer to the corresponding feature("positive") as z s l ∈ R C l and the other features("negatives") as z S\s l ∈ R (S l −1)C l , where C l is the number of channels at each layer. Similarly, we encode the output image y into ẑ l L = H l G l e (G (x)) L . We aim to match corresponding input-output patches at a specific location. In Fig. 2, for example, the head of the output zebra should be more strongly associated with the head of the input horse than the others, such as legs and grass. Thus, the PatchNCE loss can be expressed as In addition, L PatchNCE (G, H , Y ) is computed on images from the domain Y to prevent the generator from making unnecessary changes.

MCL loss
PatchNCE loss enhances the performance of the generator by learning the correspondence between input and output image patches. We further improve the performance of the discriminator using the feature information of the discriminator output layer, which is named MCL loss.
Generally, the discriminator estimates the realness of an input sample using a single scalar. However, this simple mapping undoubtedly misses some important feature information. Therefore, it is easy to overfit because the discriminator is not strong enough. To make full use of the feature information of the discriminator output layer, we use it to construct a contrastive loss instead of simply mapping it to a probability space. We treat the feature information of the discriminator output layer into a n ×n matrix A = a i, j n×n . Then, we process each row of elements of the matrix as a feature vector, that is A = α (1) , α (2) , . . . , α (n) T , where α (i) = a i,1 , a i,2 , . . . , a i,n . And we normalize each feature vector to obtain f (A) = f α (1) , f α (2) , . . . , f α (n) T . Next, we construct the MCL loss by studying the relationship between different feature vectors.
As shown in Fig. 2, for an output image y = G(x) and an image y from the target domain Y , we have f (A (y ) ) = ( f (y (1) ), f (y (2) ), . . . , f (y (n) )) T and f (A (y) ) = ( f (y (1) ), f (y (2) ), . . . , f (y (n) )) T by the discriminator (here, n=30). Naturally, we want any feature vector f (y (i) ) of y to be as close as possible to others of y and far away from the feature vectors of y . We let Formally, the contrastive loss is defined by where ω = 0.1.
According to Eq. 6, the MCL loss of the discriminator is defined as follows:

Final objective loss
Our final objective loss includes adversarial loss, two Patch-NCE losses, and MCL loss, as follows: If not specified, we choose λ X = λ Y = 1 and λ M = 0.01. Compared with the existing methods, MCL achieves stateof-the-art results. In addition, to further reduce the model training parameters and improve the training speed, we also propose a lighter and faster version, named FastMCL. In FastMCL, we no longer consider the effect of L PatchNCE (G, H , Y ) on the training process, that is to make λ Y = 0. Surprisingly, even so, FastMCL achieves slightly worse performance compared to CUT [36]. All experimental results are shown in Sect. 3.3.

Experiments
We evaluated the performance of different methods on several datasets. And we introduced the training details, datasets, and evaluation protocols of the experiments in turn. Extensive experiments were performed on unpaired image translation tasks. Furthermore, our proposed method was extended to single image translation tasks. Finally, we performed an ablation study and analyzed the influence of different loss terms on the experimental results. All the experimental results prove that our proposed method is superior to existing methods.

Training details
In this paper, we mainly follow the setup of CUT [36] for training. Our full model MCL is trained up to 400 epochs, while the fast variant FastMCL is trained up to 200 epochs. Both MCL and FastMCL include a ResNet-based generator with 9 residual blocks [22] and a PatchGAN discriminator [20]. We choose the LSGAN loss [33] as an adversarial loss and train models at 256 × 256 resolution. The learning rate is set to 0.0002 and starts to decay linearly after half of the total epochs. For a single image translation task, we adopt StyleGAN2based architecture [23] for training, named SinMCL. The generator of SinMCL consists of 1 downsampling block of StyleGAN2 discriminator, 6 StyleGAN2 residual blocks, and 1 StyleGAN2 upsampling block. The discriminator of Sin-MCL has the same architecture as StyleGAN2. Since we do not use style code, the style modulation layer of StyleGAN2 was removed. Note that the coefficient of MCL loss λ M is set to 0.03.
Cat→Dog contains 5000 training images and 500 test images for each domain from the AFHQ dataset [7].
CityScapes [8] contains 2975 training and 500 test images for each domain, a city label dataset.
Monet→Photo [36] contains only a high-resolution image in each domain, which is used for single image translation.
Van Gogh→Photo contains only a high-resolution image in each domain, which is also used for single image translation.

Evaluation protocol
Fréchet Inception Distance (FID) [18] is an evaluation metric mainly used in this paper. FID was proposed by Heusel et al. and is used to measure the distance between two data distributions. That is, a lower FID indicates better results. For cityscapes, we leverage its corresponding labels to calculate the semantic segmentation scores. We use a pre-trained FCN-8s model [20,32] and score three metrics including pixel-wise accuracy (pixAcc), average class accuracy (classAcc), and mean class Intersection over Union (IoU). In addition, we compare the model parameters and training times of different methods.
We compare our proposed method with current stateof-the-art unpaired image translation methods, including CycleGAN [47], GcGAN [11], FastCUT [36], CUT [36], SimDCL [16], and DCLGAN [16]. All the experimental results show that the quality of the images generated by our method is superior to others. Moreover, our method can produce better results with a lighter computational cost of training. Table 1 shows the evaluation results of our proposed method and all baselines on Horse→Zebra, Cat→Dog, and CityScapes datasets, and their visual effects are shown in

Best values are in bold
We compared our approach on several open datasets, primarily using FID [18] evaluation metric. For CityScapes, we leverage its corresponding labels to show the semantic segmentation scores (pixAcc, classAcc, IoU). MCL produces state-of-the-art results, and takes equal or slightly worse resources than CUT [36] in model parameters and training speed (seconds per sample). Our variant FastMCL also produced desirable results

Fig. 3
Visual results of different methods. CycleGAN [47] and GcGAN [11] are cycle-consistency methods. CUT [36], FastCUT [36], DCL-GAN [16], and SimDCL [16] introduce contrastive learning into the generator. Our versions MCL and FastMCL further leverage contrastive learning to enhance the performance of the discriminator. The last two rows show failing cases of other methods, and our method yielded relatively satisfactory results Fig. 3. It is clear that our algorithms perform superior to all the baselines. As shown in Table 1, our MCL version produces state-of-the-art results, and takes equal or slightly worse resources than CUT [36] in model parameters and training speed (seconds per sample). Our variant FastMCL also produced desirable results. The last two rows of Fig. 3  Fig. 4 Example results of our MCL compared to DCLGAN [16] and CUT [36] on the cityscapes dataset with 256 × 256 resolution. The left column represents the ground truth and label, and the right three columns represent generated images and semantic labels by different methods. MCL achieves generated images more like the ground truth, and the semantic labels obtained through the pre-trained FCN-8s model are more like the real labels Fig. 5 Example results of our MCL compared to DCLGAN [16] and CUT [36] on the cityscapes dataset with 256 × 256 resolution. The left column represents the ground truth and label, and the right three columns represent generated images and semantic labels by different methods. MCL achieves generated images more like the ground truth, and the semantic labels obtained through the pre-trained DRN model [45] are more like the real labels show failing cases of other approaches, and our approach yielded relatively satisfactory results. For cityscapes, Table 1 reports the semantic segmentation metrics on a pre-model FCN-8s model [20,32], and our method achieves the highest performance on three metrics (pixAcc, classAcc, IoU) compared to all the baselines. Figures 4 and 5 show qualitative comparison results of our method with the two most advanced unpaired methods [16,36] on semantic labels to real tasks (Cityscapes dataset). Our MCL achieves generated images more similar to the ground truth, and the semantic labels obtained through the pre-trained FCN-8s model are more similar to the real labels.
We further compare our methods with three popular paired(supervised) methods, Pix2Pix [20], photo-realistic image synthesis system CRN [4] and discriminative region proposal adversarial network DRPAN [41] on the Cityscapes dataset. The quantitative comparison results of our method with other baselines are shown in Table 2. We leverage a pre-trained FCN-8s model [20,32] to calculate three semantic segmentation metrics. Our two versions outperform supervised methods and even approach the ground truth on three metrics (pixAcc, classAcc, IoU). It shows the superiority of our method for semantic labels to real tasks.

Single image translation
Although SinCUT, another variant of CUT [36], has beaten current methods [12,25,44] in the single image translation tasks, the detailed textures of the generated images do not seem realistic enough.
Like SinCUT, our method is also suitable for single image translation, named SinMCL. Experiments are performed on the Monet→Photo and Van Gogh→Photo datasets. Figure 6 shows a qualitative comparison between SinMCL and Sin-CUT. It is not difficult to find that our generated image has superior visual performance. For example, SinCUT gen- Best values are in bold Our two versions outperform supervised methods and even approach the ground truth in three metrics, indicating the superiority of our method erated some redundant noises in the red box area on the Monet→Photo dataset, but our application can eliminate these noises well. On the Van Gogh→Photo dataset, SinMCL successfully translated it into a real pear tree. Moreover, more details can be seen after magnification.

Ablation study
Compared with all baselines, our proposed method achieves superior performance on image translation tasks. Next, we consider the influence of different loss terms on the experimental results. To save computing resources and training time, we performed an ablation study on the Horse→Zebra dataset. The final objective loss in this paper consists of four loss items, including one adversarial loss, two PatchNCE losses, and one MCL loss, as shown in Eq. 8. The coefficients of these four loss terms are 1, λ X , λ Y , and λ M , respectively. When λ M = 0, our proposed method degenerates into CUT. Fig. 7 Training curves of different methods in terms of FID on the Horse→Zebra dataset. When the MCL loss is not considered, our method degenerates into CUT [36]. If the PatchNCE loss L PatchNCE (G, H , Y ) is not considered, it degrades to FastCUT [36]. It is not difficult to see how adding the MCL loss can stabilize the training process When λ M = λ X = 0, the method degenerates into Fast-CUT. When λ M = λ X = λ Y = 0, the method degenerates into standard GAN, which can no longer adapt to image translation tasks. Figure 7 shows the training curves of different methods on the Horse→Zebra dataset, and it shows that increasing the MCL loss term can stabilize the training process.
In Table 3, we further show the quantitative results of different methods on the Horse→Zebra dataset. We calculated the minimum, maximum, mean, and standard deviation of FID during the training process. The mean and standard deviation of FID obtained by our proposed method are the smallest, which indicates that our method has better stability. Although during the first 200 epochs of the training process, FastCUT achieved the best FID with a score of 40.8. How- Fig. 6 Single painting to photo translation. We transferred the paintings of Claude Monet and Van Gogh to a nature photograph. There is only one high-resolution image per domain in the dataset used. Our approach shows superior performance in detail. More details can be seen after magnification Best values are in bold Our method achieves a slightly worse minimum of FID than FastCUT [36] over the first 200 epochs. However, MCL obtains a much smaller SD of FID compared to CUT [36] and FastCUT. This shows that our method is more stable than others ever, its training process is volatile, and the next FID would jump to a large value, as shown in Fig. 7. Compared to Fast-CUT, our method achieves a slightly worse FID with a score of 41.4. Nevertheless, our training process is more stable. As shown in Fig. 8, we further provide a visual evaluation of the best FID achieved by different methods. During the 115th training process, CUT received the best FID with a score of 44.0. However, as shown in the red box, it appears unnatural of the head and buttock of the generated zebra by CUT. The generated zebra has no eyes on its head, and the stripes of the buttock do not match those of its body. During the 140th training process, FastCUT received the best FID with a score of 40.8. It has similar problems, such as the stripes of the head do not match those on other parts of its body. During the 195th training process, MCL received the best FID with a score of 41.4. Compared to other methods, the generated zebra by MCL looks more realistic. Next, we explained the value of hyperparameter in this paper. First of all, in Eq. 6, ω aims to scale the distance between feature vectors, which is directly set to 0.1. Then, to ensure the balance between each loss term, we conducted an ablation study for the values of λ X , λ Y , and λ M , as shown in Table 4 and Fig. 9. It is not difficult to find that adding our MCL loss can effectively improve the FID value and the quality of the generated images, and the effect is best when λ X , λ Y , and λ M are 1, 1, 0.01, respectively. Therefore, unless otherwise specified, we choose λ X = λ Y = 1 and λ M = 0.01.
Many experiments show that our method is superior to previous methods in image-to-image translation tasks. This is Fig. 8 Visual evaluation of different methods on the Horse→Zebra dataset. We show the visual effects of each approach in turn on three crucial epochs. When CUT [36] or FastCUT [36] reaches the minimum of FID, the generated image does not look realistic, as shown in the red box. Instead, our approach achieves a more realistic image. Furthermore, only the FID of our approach decreases with epochs, while the FID of other methods fluctuates (λ X , λ Y , λ M ) represents the values of the hyperparameters λ X , λ Y , and λ M . It can be seen that when λ X = λ Y = 1 and λ M = 0.01, the generated image's quality is superior to others, specifically reflected in its zebra stripes are clearer and more realistic mainly due to our proposed MCL loss. We skillfully construct MCL loss using the feature information of the discriminator output layer, which already exists, so the MCL loss hardly increases the training time and computing resources. MCL loss is simple and efficient, and to verify this, we conducted experiments on the Horse→Zebra dataset. We directly added MCL loss to the existing methods. All experimental results showed that adding our MCL loss could effectively improve the quality of the generated images, as shown in Table 5 and Fig. 10.

Conclusion
We propose a straightforward method to construct a contrastive loss using the feature information of the discriminator output layer, which is named MCL. Our proposed method enhances the performance of the discriminator and solves the problem of model collapse effectively. Extensive experiments show that our method achieves state-of-the-art results in unpaired image-to-image translation by making better use of contrastive learning. Moreover, our method performs com-parably or superior to paired methods on semantic labels to real tasks. In addition, we also propose two MCL variants, namely FastMCL and SinMCL. The former is a faster and lighter version for unpaired image-to-image translation tasks, and the latter is suitable for single image translation tasks. FastMCL and SinMCL have achieved great results in their tasks, respectively.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.