In this section, we introduce the pixel-level domain transfer problem. Let us define a source image domain \(S\subset \mathbb {R}^{W\times H\times 3}\) and a target image domain \(T\subset \mathbb {R}^{W\times H\times 3}\). Given a transfer function named a converter C, our task is to transfer a source image \(I_S\in S\) to a target image \(\hat{I}_T\in T\) such as
$$\begin{aligned} \hat{I}_T=C(I_S|\varTheta ^C), \end{aligned}$$
(3)
where \(\varTheta ^C\) is the model parameter of the converter. Note that the inference \(\hat{I}_T\) is not a feature vector but itself a target image of \(W\times H\times 3\) size. To do so, we employ a convolutional network model for the converter C, and adopt a supervised learning to optimize the model parameter \(\varTheta ^C\). In the training data, each source image \(I_S\) should be associated with a ground-truth target image \(I_T\).
4.1 Converter Network
Our target output is a pixel-level image. Furthermore, the two domains are connected by a semantic meaning. Pixel-level generation itself is challenging but the semantic transfer makes the problem even more difficult. A converter should selectively summarize the semantic attributes from a source image and then produce a transformed pixel-level image.
Table 1. Details of each network. In (a), each entry in {\(\cdot \)} corresponds to each network. L-ReLU is leaky-ReLU. In (b), F denotes fractional-stride. The activation from the first layer is reshaped into 4 \(\times \) 4 \(\times \) 1,024 size before being fed to the second layer.
The top network in Fig. 2 shows the architecture of the converter we propose. The converter is a unified network that is end-to-end trainable but we can divide it into the two parts; an encoder and a decoder. The encoder part is composed of five convolutional layers to abstract the source into a semantic 64-dimensional code. This abstraction procedure is significant since our source domain (e.g. natural fashion image) and target domain (e.g. product image) are paired in a semantic content (e.g. the product). The 64-dimensional code should capture the semantic attributes (e.g. category, color, etc.) of a source to be well decoded into a target. The code is then fed by the decoder, which constructs a relevant target through the five decoding layers. Each decoding layer conducts the fractional-strided convolutions, where the convolution operates in the opposite direction. The reader is referred to Table 1 for more details about the architectures of the encoder and the decoder.
4.2 Discriminator Networks
Given the converter, a simple choice of a loss function to train it is the mean-square error (MSE) such as \(||\hat{I}_T-I_T||_2^2\). However, MSE may not be a proper choice due to critical mismatches between MSE and our problem. Firstly, MSE is not suitable for pixel-level supervision for natural images. It has been well known that MSE is prone to produce blurry images because it inherently assumes that the pixels are drawn from Gaussian distribution [14]. Pixels in natural images are actually drawn from complex multi-modal distributions. Besides its intrinsic limitation, it causes another critical problem especially for the pixel-level domain transfer as follows.
Given a source image, the target is actually not unique in our problem. Our target domain is the lowest pixel-level image space, not the high-level semantic feature space. Thus, the number of possible targets from a source is infinite. Figure 1 is a typical example showing that the target is not unique. The clothing in the target domain is captured in various shapes, and all of the targets are true. Besides the shapes, the target image can be captured from various viewpoints, which results in geometric transformations. However, minimizing MSE always forces the converter to fit into one of them. Image-to-image training with MSE never allows a small geometric miss-alignment as well as various shapes. Thus, training the converter with MSE is not a proper use for this problem. It would be better to introduce a new loss function which is tolerant to the diversity of the pixel-level target domain.
In this paper, on top of the converter, we place a discriminator network which plays a role as a loss function. As in [2, 6, 17], the discriminator network guides the converter to produce realistic target under the supervision of real/fake. However, this is not the only role that our discriminator plays. If we simply use the original discriminator replacing MSE, a produced target could look realistic but its contents may not be relevant to the source. This is because there is no pairwise supervision such as MSE. Only real/fake supervision exists.
Given arbitrary image triplets (\(I_S^+, I_S^{\oplus }, I_S^-\)) in the source domain S, where \(I_S^+\) and \(I_S^{\oplus }\) are about the same object while \(I_S^-\) is not, a converter transfers them into the images (\(\hat{I}_T^+, \hat{I}_T^{\oplus }, \hat{I}_T^-\)) in the target domain T. Let us assume that these transferred images look realistic due to the real/fake discriminator. Beyond the realistic results, the best converter C should satisfy the following condition,
$$\begin{aligned} s\left( \hat{I}_T^+, \hat{I}_T^{\oplus }\right)> s\left( \hat{I}_T^+, \hat{I}_T^-\right) \;\;\text {and}\;\; s\left( \hat{I}_T^+, \hat{I}_T^{\oplus }\right) > s\left( \hat{I}_T^{\oplus }, \hat{I}_T^-\right) , \end{aligned}$$
(4)
where s(\(\cdot \)) is a semantic similarity function. This condition means that an estimated target should be semantically associated with the source. One supervision candidate to let the converter C meet the condition is the combined use of MSE with the real/fake loss. However, again, it is not the best option for our problem because the ground-truth \(I_T\) is not unique. Thus, we propose a novel discriminator, named domain discriminator, to take the pairwise supervision into consideration.
The domain discriminator \(D_A\) is the lowest network illustrated in Fig. 2. To enable pairwise supervision while being tolerant to the target diversity, we significantly loosen the level of supervision compared to MSE. The network \(D_A\) takes a pair of source and target as input, and produces a scalar probability of whether the input pair is associated or not. Let us assume that we have a source \(I_S\), its ground truth target \(I_T\) and an irrelevant target \(I_T^-\). We also have an inference \(\hat{I}_T\) from the converter C. We then define the loss \(\mathcal {L}_A^D\) of the domain discriminator \(D_A\) as,
$$\begin{aligned}&\mathcal {L}_A^D(I_S,I)=-t\cdot \log [D_A(I_S, I)] + (t-1)\cdot \log [1-D_A(I_S, I)],\nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \text {s.t.}\;\;t=\left\{ \begin{matrix} 1&{}\;\text {if}\;\;I=I_T\\ 0&{}\;\text {if}\;\;I=\hat{I}_T\\ 0&{}\;\;\text {if}\;\;I=I_T^-. \end{matrix}\right. \end{aligned}$$
(5)
The source \(I_S\) is always fed by the network as one of the input pair while the other I is chosen among (\(I_T^-\), \(\hat{I}_T\), \(I_T\)) with equal probability. Only when the source \(I_S\) and its ground-truth \(I_T\) is paired as input, the domain discriminator is trained to produce high probability whereas it minimizes the probability in other cases. Here, let us pay more attention to the input case of (\(I_S\), \(\hat{I}_T\)).
The produced target \(\hat{I}_T\) comes from the source but we regard it as an unassociated pair (\(t=0\)) when we train the domain discriminator. Our intention of doing so is for adversarial training of the converter and the domain discriminator. The domain discriminator loss is minimized for training the domain discriminator while it is maximized for training the converter. The better the domain discriminator distinguishes a ground-truth \(I_T\) and an inference \(\hat{I}_T\), the better the converter transfers the source into a relevant target.
In summary, we employ both of the real/fake discriminator and the domain discriminator for adversarial training. These two networks play a role as a loss to optimize the converter, but have different objectives. The real/fake discriminator penalizes an unrealistic target while the domain discriminator penalizes a target being irrelevant to a source. The architecture of the real/fake discriminator is identical to that of [17] as illustrated in Fig. 2. The domain discriminator also has the same architecture except for the input filter size since our input pair is stacked across the channel axis. Several architecture families have been proposed to feed a pair of images to compare them but a simple stack across the channel axis has shown the best performance as studied in [27]. The reader is referred to Table 1 for more details about the discriminator architectures.
4.3 Adversarial Training
In this section, we present the method for training the converter C, the real/fake discriminator \(D_R\) and the domain discriminator \(D_A\). Because we have the two discriminators, the two loss functions have been defined. The real/fake discriminator loss \(\mathcal {L}_R^D\) is Eq. (2), and the domain discriminator loss \(\mathcal {L}_A^D\) is Eq. (5). With the two loss functions, we follow the adversarial training procedure of [6].
Given a paired image set for training, let us assume that we get a source batch \(\{I_S^i\}\) and a target batch \(\{I^i\}\) where a target sample \(I^i\) is stochastically chosen from \((I_T^i, I_T^{i-}, \hat{I}_T^i)\) with an equal probability. At first, we train the discriminators. We train the real/fake discriminator \(D_R\) with the target batch to reduce the loss of Eq. (2). The domain discriminator \(D_A\) is trained with both of source and target batches to reduce the loss of Eq. (5). After that, we freeze the updated discriminator parameters \(\{\hat{\varTheta }_R^D,\hat{\varTheta }_A^D\}\), and optimize the converter parameters \(\varTheta ^C\) to increase the losses of both discriminators. The loss function of the converter can be represented as,
$$\begin{aligned} \mathcal {L}^C(I_S,I)=-\frac{1}{2}\mathcal {L}_R^D \left( I\right) -\frac{1}{2}\mathcal {L}_A^D(I_S,I),\;\; \text {s.t.}\;\;I=\text {sel}\left( \{I_T,\hat{I}_T,I_T^-\}\right) , \end{aligned}$$
(6)
where sel(\(\cdot \)) is a random selection function with equal probability. The reader is referred to Algorithm 1 for more details of the training procedures.