Unsupervised image translation with distributional semantics awareness

Unsupervised image translation (UIT) studies the mapping between two image domains. Since such mappings are under-constrained, existing research has pursued various desirable properties such as distributional matching or two-way consistency. In this paper, we re-examine UIT from a new perspective: distributional semantics consistency, based on the observation that data variations contain semantics, e.g., shoes varying in colors. Further, the semantics can be multi-dimensional, e.g., shoes also varying in style, functionality, etc. Given two image domains, matching these semantic dimensions during UIT will produce mappings with explicable correspondences, which has not been investigated previously. We propose distributional semantics mapping (DSM), the first UIT method which explicitly matches semantics between two domains. We show that distributional semantics has been rarely considered within and beyond UIT, even though it is a common problem in deep learning. We evaluate DSM on several benchmark datasets, demonstrating its general ability to capture distributional semantics. Extensive comparisons show that DSM not only produces explicable mappings, but also improves image quality in general.


Introduction
Unsupervised image translation (UIT) has been intensively studied in recent years. Many applications have been inspired by its ability to create mappings between two image domains. Since there can be theoretically an infinite number of mappings between two domains, UIT is by nature an underconstrained problem. Naturally, different approaches have been developed to ensure certain desirable properties, such as shared latent spaces [1], two-way consistency [2], pair-wise distance preservation [3], and image semantics [4]. While existing researches tend to focus on general distributional matching [3,5], we aim to investigate a rarely examined perspective: the distributional semantics during UIT.
We define distributional semantics as the visually understandable variations between samples (not within a single sample). Shoes vary in color, style (e.g., low/high collars), and functionality (e.g., sneakers/high heels). Similarly, bags also vary in color, style (e.g., with/without handles), and functionality (e.g., purses/backpacks). During UIT, we argue that it is not enough to simply translate images. It is desirable for such distributional semantics to be maintained, e.g., red high-collar high heels map to black purses with handles, while white low-collar sneakers map to blue backpacks. This is exactly the goal of this research.
To maintain distributional semantics during UIT, two critical problems must be addressed. The first question is what semantics should be maintained. As we are considering unsupervised learning, no labeling should be required, which means that the data variations (the distributional semantics) should be characterized without prior knowledge, yet be interpretable by humans. Secondly, distributional semantics are rarely considered in general in deep learning, because data are usually transformed numerous times in the model. To maintain such semantics, we need a mechanism to ensure that the data distribution remains as undistorted as possible, especially in the dimensions of the semantics of interest, during transformations mapping between two domains.
In this paper, we propose a novel deep learning method called distributional semantics mapping (DSM). Given two image datasets A = {x i } and B = {y j }, and a desire to characterize visual semantics in an unsupervised manner, we find that the covariance structure of the data naturally reflects important visual semantics. We choose principal component analysis (PCA) to characterize the covariance structure: Härkönen et al. [6] have already demonstrated that PCA applied in feature space can produce interpretable controls for image synthesis. Our approach is agnostic to specific network architectures and consists of three key modules. The first is a semantics-preserving transformation e where the variations of xs in a direction (e.g., the first principal component of A) must be consistent with the variations of latent vectors z x = e(x) in its corresponding direction (e.g., the first principal component of e(A)) in the latent space. We use two such encoders e A and e B to project A and B into a shared latent space. The second module aligns the key dimensions of e A (A) and e B (B). The last module is a decoder/generative network g which decodes y by y recon = g(e B (y)) and translates x by y trans = g(e A (x)).
As far as we know, this is the first approach for UIT that preserves distributional semantics. We identify the importance of preserving distributional semantics, which has a wide range of implications for image translation and beyond. We propose a new approach that helps preserve distribution semantics during image transformations.

Generative adversarial networks
Generative adversarial networks (GANs) [7] have achieved great success for a fast-growing number of computer vision tasks, including image generation [8,9], image colorization [10], image inpainting [11], and image super-resolution [12]. Conditional GANs [13] can be used to perform image-to-image translation [2,[14][15][16]. Recently, some interactive systems have been proposed using GANs for real-time portrait image editing [17,18]. Our work also utilizes GANs conditioned on an input image, but it does not rely on any specific GAN model. To validate its generality, we employ two widely used GAN models: LSGAN [19] and NSGAN [7].

Image-to-image translation
To perform image-to-image translation, early methods such as pix2pix [10,20] often required the networks to be trained with paired training data. Recently, a variety of approaches [14,15] have been proposed to learn the image translation from unpaired data. For example, CycleGAN [2] leverages cycle consistency to constrain the mapping. Lu et al. [21,22] show that optimal transport costs can improve the generative network. UNIT [1] assumes that two image domains can share the same latent space. By decomposing the image into the style (domain-specific) code and content (domaininvariant) code, MUNIT [23] and DRIT [24] can synthesize diverse outputs from an input image. Alami Mejjati et al. [25] and Kim et al. [26] improve translation results using an attention mechanism. Choi et al. [27] propose StarGAN which can perform image translation for multiple domains using a single GAN. More recently, DRIT++ [28] extends DRIT to support multiple domains, while StarGAN v2 [29] extends StarGAN to generate diverse images across multiple domains. FUNIT [30] can work on previously unseen target classes given only a few example images. To enable unsupervised one-sided mapping, Benaim and Wolf [3] present DistanceGAN that maintains the distances between images, and Fu et al. [31] employ other geometric constraints (e.g., orientation). Our method differs from existing methods in that it explicitly preserves and matches the distributional semantics in domains during UIT, which generates explicable mappings between images.

Approach
Given two image datasets A = {x i } and B = {y j }, we aim to compute a mapping M : A → B, so that the distributional semantics of A are aligned with those of B; we use principal components (PCs) to describe the semantics. First, we project A using an is a latent vector. We keep the distributional semantics of A and z A aligned during the encoding process. A similar projection is applied to B using an encoder e B to get z B = {z y i }. We denote the PCs of the two latent distributions as V A ∈ R P ×P and V B ∈ R P ×P where P is the dimensionality of the latent vectors. Next we ensure the top k PCs of z A are aligned with those of z B . Finally, we use a generative network g to reconstruct B using y recon = g(e B (y)) and translate A by y trans = g(e A (x)). The model is shown in Fig. 1. Below, we give the details of the components.

Semantics-preserving transformation
The images need to go through a sequence of transformations during translation, which in deep learning are usually some encoding processes such as convolutions. However, the shape of the latent distribution seems to be rarely considered in terms of its semantic consistency with the data distribution itself. Existing efforts such as imposing a prior distribution [32] or geometric constraints [3] are mostly aimed at encouraging the latent distribution to behave well, rather than to match the data distribution. As a result, current encoders may not be able to preserve the distributional semantics, as we confirm in experiments. We, therefore, introduce a new general autoencoding scheme to preserve the distributional semantics: where x is a data sample, and e and d are encoding and decoding schemes respectively. U and V are the first K PCs of A and z x . The autoencoder can be trained by, e.g., minimizing ||x−x || 2 2 . Equation (1) states the key difference between our autoencoder and existing autoencoders: it requires the projections of x on the data-space PCs U to be equal to the projections of z x onto the latent-space PCs V . Note that a standard dimensionality reduction using PCA   (1) is more general because it does not dictate what V is, nor does it require the latent space to have fewer dimensions than the data space. Equation (1) only requires the covariance structure to persist when encoding along the first K PCs in both the data and latent space. One key question is why we do not just set V = U . This is because we need the flexibility of encoding data into an arbitrary V during UIT while keeping the general shape of the data distribution, as explained later.
Equation (1) is general but difficult to optimize because it needs to be computed over the whole dataset, with high memory requirements, and V is unknown. Therefore, we propose a local scheme which keeps the global alignment by enforcing local alignments on samples: where cos() is the cosine distance and N is the batch size. L angle and L norm are computed on vectorized data samples and corresponding latent vectors respectively. The goal of L localAlign is that, given any two data samples, their length ratio and angle should remain the same after projection into the latent space. L angle aims to keep the overall shape of the distribution when the data are projected into the latent space, while L norm allows scaling but prefers uniform scaling. L localAlign has the effect of preserving the covariance structure of the data during transformation. While we only apply local alignment constraints between a certain number of pairs in each batch, we randomly sample various batches in each training epoch, constraining further distinct pairs in the process. Furthermore, the locality of this approach, only considering two samples a time, essentially ensures that the covariance structure is invariant under homogeneous transformations of the basis in the latent space, which allows the autoencoder to choose the optimal V that aids reconstruction while preserving the covariance structure.

Semantics-based manifold alignment
Given z A and z B , we have two latent distributions with their respective covariance structures chara-cterized by their latent space PCs V A ∈ R P ×P and V B ∈ R P ×P . To ensure that the UIT maintains the distributional semantics, we need to align z A and z B , e.g., by aligning the direction in which z A shows the biggest variation with that of z B by aligning their first PCs. Further, to maintain visual semantics in multiple dimensions, we should align the top K PCs in V A and V B . Several alternative methods are possible, such as aligning V A and V B directly, or fixing one and aligning the other to it. Since both V A and V B are unknown, directly aligning V A and V B corresponds to minimizing N A and N B are the numbers of images in A and B respectively. V k A and V k B are the kth PCs of V A and V B , respectively. Experimentally, we have found that allowing both V A and V B to change causes the optimization to settle into local minima, so we fix V A and align V B with it. This also means that e A can be pre-trained and we can compute V A from z A using PCA.
Aligning V B to V A still presents challenges because directly learning V B requires to simultaneously transform all z y s which is again equivalent to operating on the whole of B. This is a similar difficulty to the one in Section 3.2. Again, we operate only on a batch of N samples to ensure the global alignment of V A and V B : where z y j ∈ z B and V A[0:K−1] ∈ R P ×K contains the first K PCs of interest of z A . L maniAlign requires z y to be reconstructable after projecting them into the basis of z A , which essentially encourages the covariance structure of z B to be similar to that of z A , and the two bases to be aligned. The term k exp(−α||z y i || 2 2 ) prevents z y i from shrinking, leading to a trivial solution of Eq. (4). Since the covariance structures of A and B are kept in z A and z B via Eq. (2), L maniAlign completes the semantics-based alignment of two domains.

Simultaneous decoding and translation
After semantics-based manifold alignment, we transform z A and z B into the space of B to finish the UIT. Since z A and z B are aligned, we combine the reconstruction and translation tasks using a single network g which serves both as a decoder and a translator: the general shapes of the two latent distributions are similar after alignment. When the decoder has been trained with the reconstruction loss on z B , it has already to some extent learnt to translate z A . We reconstruct B using y recon = g(z y ). Meanwhile, we treat g as a generator in a generative adversarial network using y trans = g(z x ) and use a discriminator network h(y trans , y) = [0, 1] to further improve the translation.
Overall, given a pre-trained e A and z A , and hence also V A , we minimize the objective function in Eq. (5): where L g and L d are the GAN loss that depends on the chosen GAN model. N B is the total number of images in B and ω i are weights. In L B localAlign , we apply Eq. (2) to both the y-to-z y , and z x -to-y trans mappings. Further details are in the Appendix.

Implementation details
We pre-train an autoencoder for dataset A with L localAlign to get e A and calculate V A from z A using PCA. For e B and g, we adopt the network architectures from UNIT. In all experiments, we set ω 1 = 0.033, ω 2 = ω 3 = 0.333, and other weights according to the experiment. See the Appendix for details. e A is trained for 100 epochs and the remainder are trained for 300 epochs on all datasets, using Adam [33] with a batch size of 16, a learning rate of 0.0001, and exponential decay rates (β 1 ; β 2 ) = (0.5; 0.999). All experiments were conducted using 2 NVIDIA GTX 1080 Ti GPUs, and PyTorch. Training took 6-18 hours.

Data
We employed several benchmark datasets to validate our method, including SummerWinter [2], CatDog [24], and ShoeHandbag [34,35]. We also built a very challenging dataset, MMISTHandbag, using hand-drawn digits from MNIST [36], and handbags randomly sampled from Ref. [35]; it has two distinctive distributions and distributional semantics. All comparative results in this section were computed using LSGAN. See the Appendix for details and further results.

Evaluation metrics
In addition to visual evaluation, we employ the Frechet inception distance (FID) [37] as a quantitative measure. However, there is no good metric to measure the faithfulness of the preservation of distributional semantics. We, therefore, propose a new evaluation metric called the ordering-tolerance curve (OTC). Given images x ∈ A and their translations y trans ∈ B, we define the OTC as where β ∈ [0, 1], card(·) is the cardinality of a set, d(x, y trans ) = |rank(y trans ) − rank(x)|, rank(x) is the rank of x in A along a chosen PC among all data samples in A and N A is the total number of data samples in A, and rank(y trans ) is the rank of y trans along a chosen PC in B. d(x, y trans ) is equal to zero if the rank of x is retained during translation, and non-zero otherwise (the larger the worse). c is the percentage of correctly ordered x values whose normalized rank errors are within β, the ordering error tolerance.

Semantics-preserving transformation
A key contribution of our work is a straightforward but extremely effective transformation scheme that preserves distributional semantics. To show the necessity of such transformations in UIT, we trained an autoencoder on cat images in CatDog then computed the 1st PCs of A and z A . The autoencoder is based on the encoder and decoder of UNIT. Please see the Appendix for further details. We then ranked all images along the two PCs: see Fig. 2.
In the original data, the variation on the 1st PC is mainly a color transition from light to dark (see Fig. 2(top)). However, without semantics-preserving transformations, the shape of the latent distribution is changed, and unable to preserve the visual semantics ( Fig. 2(bottom)). In contrast, DSM preserves the distributional semantics ( Fig. 2(middle)). We also show the OTCs in Fig. 3(left): DSM can contain the rank error to under 3% while a standard autoencoder fails systematically. Although we only show the results from a specific autoencoder, our preliminary experiments showed this to be a common problem of autoencoders.

Architectural studies
Two main adjustable components of DSM are the GAN architecture and the number of PCs, K. We first evaluated DSM on the first K PCs on all datasets; we present the FID scores in Table 1. While we expected that a larger K would result in a harder optimization problem and hence lower quality, the results show that it depends on the dataset. After considering the data, this is understandable because different datasets have different variance distributions over the PCs: some have large variance on the 1st PC while others have variances spread over the first K PCs, which affects the behavior of Eq. (4). Although control can be added, e.g., by adding weights to different PCs, we choose not to do so and let DSM adapt to the data. We only tested DSM up to K = 3: while DSM can work for K 3, it is typically hard for humans to visually understand the semantics in PCs where K 3. We also compared two GAN architectures LSGAN and NSGAN on ShoeHandbags and show the OTCs in Fig. 3(middle). Both keep the rank error to within around 20%, showing that alignment constraints can improve translation ability visibly for differing GAN architectures. Further results can be found in the Appendix.

Image quality in translation
Although normally the first K PCs bear visually understandable semantics, the specific value of K depends on the dataset. The first PC is almost always visually interpretable across a dataset, and some datasets have meaningful variations on other PCs. We show the images from the four datasets ranked along their respective meaningful PCs in Fig. 4. For CatDog, SummerWinter, and ShoeHandbags, the first PC (PC0) shows color variations from light to dark, for MNIST, shape varies from slim to round, while for ShoeHandbags, the second PC (PC1) shows shoe collar height variation for shoes and handle length variation for handbags.
In Fig. 5, we show our results for mappings of cat-to-dog, summer-to-winter, shoes-to-handbags, and digits-to-handbags. PC0s of CatDog and SummerWinter are color variations (light to dark). The major difference is that the color variations in CatDog are clearly separated into foreground (faces) and background while there is no such separation in SummerWinter. In both cases, DSM successfully translates the images with high quality while simultaneously matching the semantics. In cat-to-dog, DSM transforms the faces and maintains the separation of the foreground and background, while in summer-to-winter, the changes are more heterogeneous depending on the scenes; DSM lays snow on different landscapes. In shoes-tohandbags, unlike CatDog and SummerWinter, the color variation is restricted to the object itself, and DSM faithfully keeps the semantics. Finally, to push DSM further, we tested a digit-to-handbag translation. These two datasets have distinctive distributions. The results show that long slim digits are translated to handbags in light colors while fat round digits are translated to handbags in dark colors, which are consistent with their respective PC0s in Fig. 4. We also show the OTCs in Fig. 3(right).

Comparisons
We compare our model to UNIT, CycleGAN, DistanceGAN, and DRIT++ on the CatDog and ShoeHandbag datasets, by using the public code shared by the authors of these methods. We first give FID scores for all methods in Table 2 and OTCs in Fig. 6. By aligning two data manifolds based on their semantics, DSM is able to improve the translation quality for both datasets. The OTCs show clearly that DSM can keep the semantics on PC0 better than the other methods by containing the rank error to within around 30% and 18% (see Fig. 6). The second best methods contain it to roughly only 50% and 40%. On PC1 of ShoeHandbag (see Fig. 6(right)), CycleGAN is close to DSM. However, we argue that its behavior is inconsistent: see Figs. 6(left) and 6(middle) where the variance is large and contains large amounts of semantic information. Visual comparisons can be found in Fig. 7. Overall, DSM generates images of higher visual quality. Additionally, other methods are incapable of preserving or matching the distributional semantics. In Fig. 7(left), the major variation of the input images from left-to-right is a color variation, light-todark. While DSM obviously keeps the same variations during UIT, it is hard to find similar effects in other methods. Similar observations can also be made in Fig. 7(middle). To further explore semantics in other PCs, we show Fig. 7(right). While the input varies from slippers to sneakers, our results generate handbags varying from those without handles to those with handles. In contrast, other methods struggle to generate consistent semantic variation. Further results can be found in the Appendix.

Discussions and conclusions
While PCA is a straightforward way of characterizing distributional semantics, our method can incorporate alternative techniques such as kernel PCA, multidimensional scaling, and others that can define the covariance structure by "flattening" the data manifold before performing DSM.
We have only evaluated DSM up to K = 3 because the semantics start to lack visual meaning for K > 3. However, we argue that DSM is effective and useful for two reasons. Firstly, for almost all datasets, the first PC bears the majority of the variance, and the distributional semantics captured by the variance are always visually interpretable. Secondly, aligning the PCs of the two distributions during translation increases image quality.
In summary, we have proposed DSM, the first UIT method which preserves and matches the distributional semantics of two image domains. It is straightforward and effective, as demonstrated on multiple datasets, and capable of improving translation quality compared to the state-of-the-art. DSM is also general in its capacity to incorporate any GAN and autoencoder model. In future, we will incorporate human-labelling in a semi-supervised setting of DSM where humans can arbitrarily decide the semantics by ranking images. This will enable DSM to encode arbitrary semantics and open it up to many other applications.

Equation (5) contains a local alignment loss L B
localAlign , which applies Eq. (2) to both y-to-z y , and z x -to-y trans . Applying Eq. (2) to y-to-z y is straightforward, and ensures semantics-preserving transformation between B and z B , just as does the encoder of A. Here we explain why applying Eq. (2) to z x -to-y trans is essential. As Section 3.4 explains, we align z A and z B , and use a single network g to serve as both the decoder for z B and the generator for z A . This mechanism provides an implicit constraint on the translated images {y trans } from z A so that it has a similar distribution to the reconstructed images {y recon } from z B . However, as there is an additional GAN loss which modifies the distribution of the translated images, the alignment of {y trans } and {y recon } may be affected, compromising the distributional semantics matching. Thus, we apply Eq. (2) to z x -to-{y trans } to explicitly preserve the semantics: where L e localAlign and L g localAlign are the loss terms for y-to-z y and z x -to-y trans . All experiments set ω 1 = 0.033, ω 2 = ω 3 = 0.333, ω g 4 = 0.167 in Eq. (5) of the paper. We set ω e 4 = 0.066 in ShoeHandbag align 3PCs experiments and ω e 4 = 0.05 in all other experiments. We set k = 15, α = 10 −6 for the regularization term in Eq. (4).

B.1 Data details
For all datasets, images were resized to 256 × 256. In CatDog, 871 cat (birman) and 1364 dog (husky, samoyed) images were randomly divided into 771 (cat) and 1264 (dog) for training and the remainder used for testing. SummerWinter comprises 1540 summer photos and 1200 winter photos, which were randomly divided into 1231 (summer) and 962 (winter) for training and the remainder used for testing. For ShoeHandbag, we randomly sampled images from edges2shoes and edges2handbags, using 3726 (shoe) and 3822 (handbag) for training, and 101 (shoe) and 178 (handbag) for testing. For MNISTHandbag, 1600 MNIST images and 1600 handbag images were randomly selected from MNIST and edges2handbags, with 1500 of each for training and 100 for testing. We show images from the ShoeHandbag dataset along the first 3 PCs in Fig. 8 Figure 10 shows OTCs for our method on the CatDog, MNISTHandbag, and SummerWinter datasets, with up to 3 PCs aligned. We can see that our method always keeps semantics best for PC0, and the rank errors become larger on PC1 and PC2. This is mainly because the variance ratio along PC0 is always much larger than along other PCs. For example, in the Handbag dataset, the ratios of variances along the first 3 PCs are about 4:2:1. As a result, the network prefers to perform alignment along PC0 in preference to minimize the total loss. We also note that semantics preservation varies largely across different datasets. In the challenging MNISTHandbag dataset which has two distinctive distributions, our method preserves semantics well along PC0, while in the SummerWinter dataset, our method is capable of preserving semantics on the first 3 PCs. Although control could be used, e.g., via weights to enforce the alignment of multiple PCs, we choose not to do so and make DSM adapt to data, as the distribution of variances on different PCs is an intrinsic property of the data itself which should be respected during translation.

B.2 Our OTCs
We evaluate semantics preservation only for the first 3 PCs for two main reasons. Firstly, in most image datasets, compared to the variances on the first 3 PCs, the variances on the remaining PCs are very small. For example, in the Summer dataset, even the sum of variances on the 4th-20th PCs is smaller that the variance of PC0. Enforcing alignment along directions with very small variations adds complexity to the optimization while decreasing the explicability of the mapping. Secondly, by investigating various popular datasets (e.g., Shoes, Handbags, Cars, Animals, Faces, Art works), we discovered that while people can easily perceive semantics on the first PC in all datasets, they can only do so on the second PC in the Shoe, Handbags, and MNIST datasets. People cannot perceive any semantics on the 4th and subsequent PCs. Hence we focus on the first 3 PCs. Figure 9 and Table 5 compare using LSGAN and NSGAN for image translation on the ShoeHandbag dataset when aligning the first 1-3 PCs. The OTCs show that the two GAN models result in very similar semantics preservation in all cases, demonstrating that our method does not rely on a specific GAN model and can preserve semantics using different GAN models. We also note that the FIDs of NSGAN are higher than that for LSGAN. This is mainly because the image generation capability of NSGAN is weaker than that of LSGAN. Employing other GAN models such as StyleGAN [9] can improve the FID scores.

B.4 Alignment of PCs
Equation (4) in the paper requires the first K PCs of two domains to be aligned. However, it does not specify the order of alignment. In other words, it does not specify if PC1 of the first domain should be aligned with PC1 of the second domain, etc. We have two choices. The first is to enforce order, PC0-to-PC0, PC1-to-PC1, and so on. However, we find this to be sub-optimal in the sense that Eq. (4) is affected by distribution of variances across PCs. For a dataset with a majority of variance on PC0, the alignment forces due to Eq. (4) have little effect on PC1 and subsequent PCs. We argue that they should be small  because the dataset's variance on PC1 and higher PCs is less explicable. Hence, the force provided by Eq. (4) should naturally follow variance distribution. We therefore do not enforce order of alignment of PCs for two datasets. As a result, while PC0 is always mapped to PC0 in all experiments, sometimes PC1 of one dataset can be mapped to PC2 of another. We argue that such a mapping is still valid because it is explicable and reflects the distributional semantics.