TriGAN: Image-to-Image Translation for Multi-Source Domain Adaptation

Most domain adaptation methods consider the problem of transferring knowledge to the target domain from a single source dataset. However, in practical applications, we typically have access to multiple sources. In this paper we propose the first approach for Multi-Source Domain Adaptation (MSDA) based on Generative Adversarial Networks. Our method is inspired by the observation that the appearance of a given image depends on three factors: the domain, the style (characterized in terms of low-level features variations) and the content. For this reason we propose to project the image features onto a space where only the dependence from the content is kept, and then re-project this invariant representation onto the pixel space using the target domain and style. In this way, new labeled images can be generated which are used to train a final target classifier. We test our approach using common MSDA benchmarks, showing that it outperforms state-of-the-art methods.


Introduction
A well known problem in computer vision is the need to adapt a classifier trained on a given source domain in order to work on a different, target domain. Since the two domains typically have different marginal feature distributions, the adaptation process needs to reduce the corresponding domain shift [45]. In many practical scenarios, the target data are not annotated and Unsupervised Domain Adaptation (UDA) methods are required.
While most previous adaptation approaches consider a single source domain, in real world applications we may have access to multiple datasets. In this case, Multi-Source Domain Adaptation (MSDA) methods [52,31,51,36] may be adopted, in which more than one source dataset is considered in order to make the adaptation process more robust. However, despite more data can be used, MSDA is challenging as multiple domain-shift problems need to be simultaneously and coherently solved.
In this paper we deal with (unsupervised) MSDA using a data-augmentation approach based on a Generative Ad-versarial Network (GAN) [13]. Specifically, we generate artificial target samples by "translating" images from all the source domains into target-like images. Then the synthetically generated images are used for training the target classifier. While this strategy has been recently adopted in the single-source UDA scenario [40,17,27,34,41], we are the first to show how it can be effectively used in a MSDA setting. In more detail, our goal is to build and train a "universal" translator which can transform an image from an input domain to a target domain. The translator network is "universal" because the number of parameters which need to be optimized should scale linearly with the number of domains. We achieve this goal using domain-invariant intermediate features, computed by the encoder part of our generator, and then projecting these features onto the domainspecific target distribution using the decoder.
To make this image translation effective, we assume that the appearance of an image depends on three factors: the content, the domain and the style. The domain models properties that are shared by the elements of a dataset but which may not be shared by other datasets. On the other hand, the style factor represents properties which are shared among different local parts of a single image and describes low-level features which concern a specific image (e.g., the color or the texture). The content is what we want to keep unchanged during the translation process: typically, it is the foreground object shape which is described by the image labels associated with the source data samples. Our encoder obtains the intermediate representations in a twostep process: we first generate style-invariant representations and then we compute the domain-invariant representations. Symmetrically, the decoder transforms the intermediate representations first projecting these features onto a domain-specific distribution and then onto a style-specific distribution. In order to modify the underlying distribution of a set of features, inspired by [38], in the encoder we use whitening layers which progressively align the styleand-domain feature distributions. Then, in the decoder, we project the intermediate invariant representation onto a new domain-and-style specific distribution with Whitening and Coloring (W C) [42] batch transformations, according to the target data.
A "universal" translator similar in spirit to our proposed generator is StarGAN [5] (proposed in a non UDA task). However, in StarGAN the domain information is represented by a one-hot vector concatenated with the input image. When we use StarGAN in our MSDA scenario, the synthesized images are much less effective for training the target classifier, and this emiprically shows that our batchbased transformation of the image distribution is more effective for our translation task. Contributions. Our main contributions can be summarized as follows. (i) We propose the first generative MSDA method. We call our approach TriGAN because it is based on three different factors of the images: the style, the domain and the content. (ii) The proposed image translation process is based on style and domain specific statistics which are first removed from and then added to the source images by means of modified W C layers. Specifically, we use the following feature transformations (associated with a corresponding layer type): Instance Whitening Transform (IW T ), Domain Whitening Transform (DW T ) [38], conditional Domain Whitening Transform (cDW T ) and Adaptive Instance Whitening Transform (AdaIW T ). IW T and AdaIW T are novel layers introduced in this paper. (iii) We test our method on two MSDA datasets, Digits-Five [51] and Office-Caltech10 [12], outperforming state-of-theart methods.

Related Work
In this section we review the previous approaches on UDA, considering both single source and multi-source methods. Since the proposed generator is also related to deep models used for image-to-image translation, we also analyse related work on this topic. Single-source UDA. Single-source UDA approaches assume a single labeled source domain and can be broadly classified under three main categories, depending upon the strategy adopted to cope with the domain-shift problem. The first category uses first and second order statistics to model the source and the target feature distributions. For instance, [28,29,50,48] minimize the Maximum Mean Discrepancy, i.e. the distance between the mean of feature distributions between the two domains. On the other hand, [44,33,37] achieve domain invariance by aligning the second-order statistics through correlation alignment. Differently, [3,25,30] reduce the domain shift by domain alignment layers derived from batch normalization (BN) [20]. This idea has been recently extended in [38], where grouped-feature whitening (DWT) is used instead of feature standardization as in BN . In our proposed encoder we also use the DWT layers, which we adapt to work in a generative network. In addition, we also propose other style and domain dependent batch-based normalizations (i.e., IW T , cDW T and AdaIW T ).
The second category of methods computes domainagnostic representations by means of an adversarial learning-based approach. For instance, discriminative domain-invariant representations are constructed through a gradient reversal layer in [9]. Similarly, the approach in [46] uses a domain confusion loss to promote the alignment between the source and the target domain.
The third category of methods uses adversarial learning in a generative framework (i.e., GANs [13]) to reconstruct artificial source and/or target images and perform domain adaptation. Notable approaches are SBADA-GAN [40], CyCADA [17], CoGAN [27], I2I Adapt [34] and Generate To Adapt (GTA) [41]. While these generative methods have been shown to be very successful in UDA, none of them deals with a multi-source setting. Note that trivially extending these approaches to an MSDA scenario involves training N different generators, being N the number of source domains. In contrast, in our universal translator only a subset of parameters grow linearly with the number of domains (Sec. 3.2.3), while the others are shared over all the domains. Moreover, since we train our generator using (N + 1) 2 translation directions, we can largely increase the number of training sample-domain pairs effectively used (Sec. 3.3). Multi-source UDA. In [52], multiple-source knowledge transfer is obtained by borrowing knowledge from the target k nearest-neighbour sources. Similarly, a distributionweighted combining rule is proposed in [31] to construct a target hypothesis as a weighted combination of source hypotheses. Recently, Deep Cocktail Network (DCTN) [51] uses the distribution-weighted combining rule in an adversarial setting. A Moment Matching Network (M 3 SDA) is introduced in [36] to reduce the discrepancy between the multiple source and the target domains. Differently from these methods which operate in a discriminative setting, our method relies on a deep generative approach for MSDA. Image-to-image Translation. Image-to-image translation approaches, i.e. those methods which learn how to transform an image from one domain to another, possibly keeping its semantics, are the basis of our method. In [21] the pix2pix network translates images under the assumption that paired images in the two domains are available at training time. In contrast, CycleGAN [53] can learn to translate images using unpaired training samples. Note that, by design, these methods work with two domains. ComboGAN [1] partially alleviates this issue by using N generators for translations among N domains. Our work is also related to StarGAN [5] which handles unpaired image translation amongst N domains (N ≥ 2) through a single generator. However, StarGAN achieves image translation without explicitly forcing the image representations to be domain invariant, and this may lead to a significant reduction of the network representation power as the number of domains in- objects and skewered objects, respectively. The content is represented by the object's shape -square, circle or triangle. The style is represented by the color: each image input to G has a different color and each domain has it own set of styles. First, the encoder E creates a style-invariant representation using IWT blocks. DWT blocks are then used to obtain a domain-invariant representation. Symmetrically, the decoder D brings back domain-specific information with cDWT blocks (for simplicity we show only a single output domain, T ). Finally, we apply a reference style. The reference style is extracted using the style path and it is applied using the Adaptive IWT blocks.
creases. On the other hand, our goal is to obtain an explicit, intermediate image representation which is style-anddomain independent. We use IWT and DWT to achieve this. We also show that this invariant representation can simplify the re-projection process onto a desired style and target domain. This is achieved through AdaIW T and cDW T which results into very realistic translations amongst domains. Very recently, a whitening and colouring based image-to-image translation method was proposed in [4], where the whitening operation is weight-based: the transformation is embedded into the network weights. Specifically, whitening is approximated by enforcing the convariance matrix, computed using the intermediate features, to be equal to the identity matrix. Conversely, our whitening transformation is data dependent (i.e., it depends on the specific batch statistics, Sec. 3.2.1) and uses the Cholesky decomposition [6] to compute the whitening matrices of the input samples in a closed form, thereby eliminating the need of additional ad-hoc losses.

Style-and-Domain based Image Translation
In this section we describe the proposed approach for MSDA. We first provide an overview of our method and we introduce the notation adopted throughout the paper (Sec. 3.1). Then we describe the TriGAN architecture (Sec. 3.2) and our training procedure (Sec.3.3).

Notation and Overview
In the MSDA scenario we have access to N labeled , and a target unlabeled dataset T = {x k } nt k=1 . All the datasets (tar-get included) share the same categories and each of them is associated to a domain D s 1 , ..., D s N , D t , respectively. Our final goal is to build a classifier for the target domain D t exploiting the data in {S j } N j=1 ∪ T . Our method is based on two separate training stages. We initially train a generator G which learns how to change the appearance of a real input image in order to adhere to a desired domain and style. Importantly, our G learns mappings between every possible pair of image domains. Learning (N + 1) 2 translations makes it possible to exploit much more supervisory information with respect to a plain strategy in which N different source-to-target generators are trained (Sec. 3.3). Once G is trained, in the second stage we use it to generate target data having the same content of the source data, thus creating a new, labeled, target dataset, which is finally used to train a target classifier C. However, in training G (first stage), we do not use class labels and T is treated in the same way as the other datasets.
As mentioned in Sec. 1, G is composed of an encoder E and a decoder D (Fig. 1). The role of E is to "whiten", i.e., to remove, both domain-specific and style-specific aspects of the input image features in order to obtain domain and style invariant representations. Symmetrically, D "colors" the domain-and-style invariant features generated by E, by progressively projecting these intermediate representations onto a domain-and-style specific space. In For a given x i ∈ B, the task of G is to transform x i intox i such that: (1) x i andx i share the same content but (2)x i belongs to domain D l O i and has the same style of

TriGAN Architecture
The TriGAN architecture is composed of a generator network G and a discriminator network D P . As above mentioned, G comprises an encoder E and decoder D, which we describe in (Sec. 3.2.2-3.2.3). The discriminator D P is based on the Projection Discriminator ( [32]). Before describing the details of G, we briefly review the W C transform ( [42]) (Sec. 3.2.1) which is used as the basic operation in our proposed batch-based feature transformations.

Preliminaries: Whitening & Coloring Transform
Let F (x) ∈ R h×w×d be the tensor representing the activation values of the convolutional feature maps in a given layer corresponding to the input image x, with d channels and h×w spatial locations. We treat each spatial location as a d-dimensional vector, in this way each image x i contains a set of vectors which includes all the spatial locations in all the images in a batch. The W C transform is a multivariate extension of the per-dimension normalization and shift-scaling transform (BN ) proposed in ( [20]) and widely adopted in both generative and discriminative networks. W C can be described by: In Eq. 2, µ B is the centroid of the elements in B, while W B is such that: The result of applying Eq. 2 to the elements of B, is a set of whitened features B = {v 1 , ...,v h×w×m }, which lie in a spherical distribution (i.e., with a covariance matrix equal to the identity matrix). On the other hand, Eq. 1 performs a coloring transform, i.e. projects the elements inB onto a learned multivariate Gaussian distribution. While µ B and W B are computed using the elements in B (they are data-dependent), Eq. 1 depends on the d dimensional learned parameter vector β and the d×d dimensional learned parameter matrix Γ. Eq. 1 is a linear operation and can be simply implemented using a convolutional layer with kernel size 1 × 1.
In this paper we use the WC transform in our encoder E and decoder D, in order to first obtain a style-and-domain invariant representation for each x i ∈ B, and then transform this representation accordingly to the desired output domain D l O i and style image sample x O i . The next subsections show the details of the proposed architecture.

Encoder
The encoder E is composed of a sequence of standard Convolution k×k -N ormalization -ReLU -AverageP ooling blocks and some ResBlocks (more details in the Supplementary Material), in which we replace the common BN layers ( [20]) with our proposed normalization modules, which are detailed below.
Obtaining Style Invariant Representations. In the first two blocks of E we whiten first and second-order statistics of the low-level features of each X i ⊆ B, which are mainly responsible for the style of an image ( [11]). To do so, we propose the Instance Whitening Transform (IW T ), where the term instance is inspired by Instance Normalization (IN ) ( [49]) and highlights that the proposed transform is applied to a set of features extracted from a single image Note that in Eq. 3 we use X i as the batch, where X i contains only feautures of a specific image x i (Sec. 3.2.1). Moreover, each v j ∈ X i is extracted from the first two convolutional layers of E, thus v j has a small receptive field. This implies that whitening is performed using an imagespecific feature centroid µ Xi and covariance matrix Σ Xi , which represent the first and second-order statistics of the low-level features of x i . On the other hand, coloring is based on the parameters β and Γ, which do not depend on x i or l i . The coloring operation is the analogous of the shiftscaling per-dimension transform computed in BN just after feature standardization ( [20]) and is necessary to avoid decreasing the network representation capacity ( [42]).
Obtaining Domain Invariant Representations. In the subsequent blocks of E we whiten first and second-order statistics which are domain specific. For this operation we adopt the Domain Whitening Transform (DW T ) proposed in ( [38]). Specifically, for each X i ⊆ B, let l i be its domain label (see Sec. 3.1) and let B li ⊆ B be the subset of feature which have been extracted from all those images in B which share the same domain label. Then, for each v j ∈ B li : Similarly to Eq. 3, Eq. 4 performs whitening using a subset of the current feature batch. Specifically, all the features in B are partitioned depending on the domain label of the image they have been extracted from, so obtaining B 1 , B 2 , ..., etc, where all the features in B l belongs to the images of the domain D l . Then, B l is used to compute domain-dependent first and second order statistics (µ B l , Σ B l ). These statistics are used to project each v j ∈ B l onto a domain-invariant spherical distribution. A similar idea was recently proposed in ( [38]) in a discriminative network for single-source UDA. However, differently from ( [38]), we also use coloring by re-projecting the whitened features onto a new space governed by a learned multivariate distribution. This is done using the (layerspecific) parameters β and Γ which do not depend on l i .

Decoder
Our decoder D is functionally and structurally symmetric with respect to E: it takes as input the domain and style invariant features computed by E and projects these features onto the desired domain D l O i with the style extracted from the reference image x O i . Similarly to E, D is a sequence of ResBlocks and a few U psampling -N ormalization -ReLU -Convolution k×k blocks (more details in the Supplementary Material).
Similarly to Sec. 3.2.2, in the N ormalization layers we replace BN with our proposed feature normalization approaches, which are detailed below.
Projecting Features onto a Domain-specific Distribution. Apart from the last two blocks of D (see below), all the other blocks are dedicated to project the current set of features onto a domain-specific subspace. This subspace is learned from data using domain-specific coloring parameters (β l , Γ l ), where l is the label of the corresponding domain. To this purpose we introduce the conditional Domain Whitening Transform (cDW T ), where the term "conditional" specifies that the coloring step is conditioned on the domain label l. In more detail: Similarly to Eq. 4, we first partition B into B 1 , B 2 , ..., etc. However, the membership of v j ∈ B to B l is decided taking into account the desired output domain label l O i for each image rather than its original domain as in case of Eq. 4. Specifically, if v j ∈ X i and the output domain label of Once B has been partitioned, we define cDW T as follows: Note that, after whitening, and differently from Eq. 4, coloring in Eq. 5 is performed using domain-specific pa- . Applying a Specific Style. In order to apply a specific style to x i , we first extract the output style from the reference image x O i associated with x i (Sec. 3.1). This is done using the Style Path (see Fig. 1), which consists of two Convolution k×k -IW T -ReLU -AverageP ooling blocks (which share the parameters with the first two layers of the encoder) and a MultiLayer Perceptron (MLP) F. Following ( [11]) we represent a style using the first and the sec- , which are extracted using the IW T blocks (Sec. 3.2.2). Then we use F to adapt these statistics to the domain-specific representation obtained as the output of the previous step. In fact, in principle, for each v j ∈ X O i , the W hitening() operation inside the IW T transform could be "inverted" using: Indeed, the coloring operation (Eq. 1) is the inverse of whitening (Eq. 2). However, the elements of X i now lie in a feature space different from the output space of Eq. 3, thus the transformation defined by Style Path needs to be adapted. For this reason, we use a MLP (F) which implements this adaptation: Note that, in Eq. 7, have been generated, we use them as the coloring parameters of our Adaptive IWT (AdaIW T ): Eq. 8 imposes style-specific first and second order statistics to the features of the last blocks of D in order to mimic the style of x O i .

Network Training
GAN Training. For the sake of clarity, in the rest of the paper we use a simplified notation for G, in which G takes as input only one image instead of a batch. Specifically, be the generated image, starting from x i (x i ∈ D li ) and with desired output domain l O i and style image x O i . G is trained using the combination of three different losses, with the goal of changing the style and the domain of x i while preserving its content.
First, we use an adversarial loss based on the Projection Discriminator ( [32]) (D P ), which is conditioned on labels (domain labels, in our case) and uses a hinge loss: The second loss is the Identity loss proposed in ( [53]), which in our framework is implemented as follows: In Eq. 11, G computes an identity transformation, being the input and the output domain and style the same. After that, a pixel-to-pixel L 1 norm is computed.
Finally, we propose to use a third loss which is based on the rationale that the generation process should be equivariant with respect to a set of simple transformations which preserve the main content of the images (e.g., the foreground object shape). Specifically, we use the set of the affine transformations {h(x; θ)} of image x which are defined by the parameter θ (θ is a 2D transformation matrix). The affine transformation is implemented by a differentiable bilinear kernel as in ( [22]). The Equivariance loss is: In Eq. 12, for a given image x i , we randomly choose a geometric parameter θ i and we apply h(·; Then, using the same θ i , we apply h(·; θ i ) to x i and we get x i = h(x i ; θ i ), which is input to G in order to generate a second image. The two generated images are finally compared using the L 1 norm. This is a form of self-supervision, in which equivariance to geometric transformations is used to extract semantics. Very recently a similar loss has been proposed in ( [19]), where equivariance to affine transformations is used for image cosegmentation.
The complete loss for G is: Note that Eq. 9, 10 and 12 depend on the pair (x i , l O i ): This means that the supervisory information we effectively use, grows with O((N + 1) 2 ), which is quadratic with respect to a plain strategy in which N different source-totarget generators are trained (Sec. 2).
Classifier Training. Once G is trained, we use it to artificially create a labeled training dataset (T L ) for the target domain. Specifically, for each S j and each (x i , y i ) ∈ S j , we randomly pick x t ∈ T , which is used as the reference style image, and we generate: where N + 1 is fixed and indicates the target domain (D t ) label (see Sec. 3.1). (x i , y i ) is added to T L and the process is iterated. Note that, in different epochs, for the same (x i , y i ) ∈ S j , we randomly select a different reference style image x t ∈ T .
Finally, we train a classfier C on T L using the crossentropy loss:

Experimental Results
In this section we describe the experimental setup and then we evaluate our approach using common MSDA datasets. We also present an ablation study in which we separately analyse the impact of each TriGAN component.

Datasets
In our experiments we consider two common domain adaptation benchmarks, namely the Digits-Five benchmark [51] and the Office-Caltech dataset [12].
Digits-Five [51] is composed of five digit-recognition datasets: USPS [8], MNIST [24], MNIST-M [9], SVHN [35] and Synthetic numbers datasets [10] (SYNDIG-ITS). SVHN [35] contains Google Street View images of real-world house numbers. Synthetic numbers [10] includes 500K computer-generated digits with different sources of variations (i.e. position, orientation, color, blur). USPS [8] is a dataset of digits scanned from U.S. envelopes, MNIST [24] is a popular benchmark for digit recognition and MNIST-M [9] is its colored counterpart. We adopt the experimental protocol described in [51]: in each domain the train/test split is composed of a subset of 25000 images for training and 9000 images for testing. For USPS, the entire dataset is used.
Office-Caltech [12] is a domain-adaptation benchmark, obtained selecting the subset of those 10 categories which are shared between Office31 and Caltech256 [14]. It contains 2533 images, about half of which belonging to Cal-tech256. There are four different domains: Amazon (A), DSLR (D), Webcam (W) and Caltech256 (C).

Experimental Setup
For lack of space, we provide the architectural details of our generator G and discriminator D P networks in the Supplementary Material. We train TriGAN for 100 epochs using the Adam optimizer [23] with the learning rate set to 1e-4 for G and 4e-4 for D P as in [16]. The loss weighing factor λ in Eqn. 13 is set to 10 as in [53].
In the Digits-Five experiments we use a mini-batch of size 256 for TriGAN training. Due to the difference in image resolution and image channels, the images of all the domains are converted to 32 × 32 RGB. For a fair comparison, for the final target classifier C we use exactly the same network architecture used in [10,36].
In the Office-Caltech10 experiments we downsample the images to 164 × 164 to accommodate more samples in a mini-batch. We use a mini-batch of size 24 for training with 1 GPU. For the back-bone target classifier C we use the ResNet101 [15] architecture used by [36]. The weights are initialized with a network pre-trained on the ILSVRC-2012 dataset [39]. In our experiments we remove the output layer and we replace it with a randomly initialized fullyconnected layer with 10 logits, one for each class of the Office-Caltech10 dataset. C is trained with Adam with an initial learning rate of 1e-5 for the randomly initialized last layer and 1e-6 for all other layers. In this setting we also include {S j } N j=1 in T L for training the classifier C.

Results
In this section we quantitatively analyse TriGAN. In the Supplementary Material we show some qualitative results for Digits-Five and Office-Caltech10.

Comparison with State-of-the-Art Methods
Tab. 1 and Tab. 2 show the results on the Digits-Five and the Office-Caltech10 datset, respectively. Table 1 shows that TriGAN achieves an average accuracy of 90.08% which is higher than all other methods. M 3 SDA is better in the mm, up, sv, sy → mt and in the mt, mm, sv, sy → up settings, where TriGAN is the second best. In all the other settings, TriGAN outperforms all the other approaches. As an example, in the mt, up, sv, sy → mm setting, TriGAN is better than the second best method M 3 SDA by a significant margin of 10.38%. In the same table we also show the results obtained when we replace TriGAN with StarGAN [5], which is another "universal" image translator. Specifically, we use StarGAN to generate synthetic target images and then we train the target classifier using the same protocol described in Sec. 3.3. The corresponding results in Table 1 show that StarGAN, despite to be known to work well for aligned face translation, drastically fails when used in this UDA scenario.
Finally, we also use Office-Caltech10, which is considered to be difficult for reconstruction-based GAN methods because of the high-resolution images. Although the dataset is quite saturated, TriGAN achieves a classification accuracy of 97.0%, outperforming all the other methods and beating the previous state-of-the-art approach (M 3 SDA) by a margin of 0.6% on average (see Tab. 2).

Standards
Models

Ablation Study
In this section we analyse the different components of our method and study in isolation their impact on the final accuracy. Specifically, we use the Digits-Five dataset and the following models: i) Model A, which is our full model containing the following components: IWT, DWT, cDWT, AdaIWT and L Eq . ii) Model B, which is similar to Model A except we replace L Eq with the cycle-consistency loss L Cycle of CycleGAN [53]. iii) Model C, where we replace IWT, DWT, cDWT and AdaIWT of Model A with IN [49], BN [20], conditional Batch Normalization (cBN) [7] and Adaptive Instance Normalization (AdaIN) [18]. This comparison highlights the difference between feature whitening and feature standardisation. iv) Model D, which ignores the style factor. Specifically, in Model D, the blocks related to the style factor, i.e., the IWT and the AdaIWT blocks, are replaced by DWT and cDWT blocks, respectively. v) Model E, in which the style path differs from Model A in the way the style is applied to the domain-specific representation. Specifically, we remove the MLP F(.) and we di-  Tab. 3 shows that Model A outperforms all the ablated models. Model B shows that L Cycle is detrimental for the accuracy because G may focus on meaningless information to reconstruct back the image. Conversely, the affine transformations used in case of L Eq , force G to focus on the shape (i.e., the content) of the images. Also Model C is outperformed by model A, demonstrating the importance of feature whitening over feature standardisation, corroborating the findings of [38] in a pure-discriminative setting. Moreover, the no-style assumption in Model D hurts the classification accuracy by a margin of 1.76% when compared with Model A. We believe this is due to the fact that, when only domain-specific latent factors are modeled but instance-specific style information is missing in the image translation process, then the diversity of the translations decreases, consequently reducing the final accuracy (see the role of the randomly picked x t ∈ T , in Sec. 3.3). Model E shows the need of using the proposed style path. Finally, Model F shows that having a separate factor for domain yields a better performance. Note that the ablation analysis in Tab. 3 is done by removing a single component from the full model A, and the marginal difference with Model A shows that all the components are important. On the other hand, simultaneously removing all the components makes our model become similar to StarGAN, where there is no style information and where the domain information is not "whitened" but provided as input to the network. As shown in Table 1, our full model drastically outperfoms a StarGAN-based generative MSDA approach.

Multi domain image-to-image translation
Our proposed generator can be used for a pure generative (non-UDA), multi-domain image-to-image translation task.
We conduct experiments on the Alps Seasons dataset [1] which consists of images of Alps mountains with 4 different domains (corresponding to 4 seasons). Fig. 2 shows some images generated using our generator. For this experiment we compare our generator with StarGAN [5] using the FID [16] metrics. FID measures the realism of the generated images (the lower the better). The FID scores are computed considering all the real samples in the target domain and generating an equivalent number of synthetic images in the target domain. Tab. 4 shows that the TriGAN FID scores are significantly lower than the StarGAN scores. This further highlights that decoupling the style and the domain and using W C-based layers to progressively "whiten" and "color" the image statistics, yields to a more realistic cross-domain image translation than using domain labels as input as in the case of StarGAN. Target →Winter →Summer →Spring →Autumn StarGAN [5] 148

Conclusions
In this work we proposed TriGAN, an MSDA framework which is based on data-generation from multiple source domains using a single generator. The underlying principle of our approach to to obtain intermediate, domain and style invariant representations in order to simplify the generation process. Specifically, our generator progressively removes style and domain specific statistics from the source images and then re-projects the intermediate features onto the desired target domain and style. We obtained state-of-the-art results on two MSDA datasets, showing the potentiality of our approach.

A. Additional Multi-Source Results
Some sample translations of our G are shown in Figs. 3, 4, 5. For example, in Fig. 3 when the SVHN digit "six" with side-digits is translated to MNIST-M the cDWT blocks re-projects it to MNIST-M domain (i.e., single digit without side-digits) and the AdaIWT block applies the instance-specific style of the digit "three" (i.e., blue digit with red background) to yield a blue "six" with red background. Similar trends are also observed in Fig. 4.

B. Implementation details
In this section we provide the architecture details of the TriGAN generator G and the discriminator D P .
Instance Whitening Transform (IWT) blocks. As shown in Fig 6 (a) each IWT block is a sequence composed of: Convolution k×k −IW T −ReLU −AvgP ool m×m , where k and m denote the kernel sizes. There are two IWT blocks in the E. In the first IWT block we use k = 5 and m = 2, and in the second we use k = 3 and m = 2. Adaptive Instance Whitening (AdaIWT) blocks. The AdaIWT blocks are analogous to the IWT blocks except from the IWT which is replaced by the AdaIWT. The AdaIWT block is a sequence: U psampling m×m − Convolution k×k − AdaIW T − ReLU , where m = 2 and k = 3. AdaIWT also takes as input the coloring parameters (Γ, β) (See Sec. 3.2.3) and Fig. 6 (b)). Two AdaIWT blocks are consecutively used in D. The last AdaIWT block is followed by a Convolution 5×5 layer.

Style Path.
The Style Path is composed of: Fig. 6 (c)). The output of the Style Path is (β 1 Γ 1 ) and (β 2 Γ 2 ), which are input to the second and the first AdaIWT blocks, respectively (see Fig. 6 (b)). The M LP is composed of five fully-connected layers with 256, 128, 128, 256 neurons, with the last fully-connected layer having a number of neurons equal to the cardinality of the coloring parameters (β Γ). Domain Whitening Transform (DWT) blocks. The schematic representation of a DWT block is shown in Fig. 7 (a). For the DWT blocks we adopt a residual-like structure [15]: DW T − ReLU − Convolution 3×3 − DW T − ReLU − Convolution 3×3 . We also add identity shortcuts in the DWT residual blocks to aid the training process. Conditional Domain Whitening Transform (cDWT) blocks. The proposed cDWT blocks are schematically shown in Fig. 7 (b). Similarly to a DWT block, a cDWT block contains the following layers: cDW T − ReLU − Convolution 3×3 − cDW T − ReLU − Convolution 3×3 . Identity shortcuts are also used in the cDWT residual blocks.
All the above blocks are assembled to construct G, as shown in Fig. 8. Specifically, G contains two IWT blocks, one DWT block, one cDWT block and two AdaIWT blocks. It also contains the Style Path and 2 Convolution 5×5 (one before the first IWT block and another after the last AdaIWT block), which is omitted in Fig. 8 for the sake of clarity. {Γ 1 , β 1 , Γ 2 , β 2 } are computed using the Style Path.
For the discriminator D P architecture we use a Projection Discriminator [32]. In D P we use projection shortcuts instead of identity shortcuts. In Fig 9 we schematically show a discriminator block. D P is composed of 2 such blocks. We use spectral normalization [32] in D P .

C. Experiments for single-source UDA
Since, our proposed TriGAN has a generic framework and can handle N -way domain translations, we also conduct experiments for Single-Source UDA scenario where N = 2 and the source domain is grayscale MNIST. We consider the following UDA settings with the digits dataset:
Pixel-level domain adaptation [2] (PixelDA), Unsupervised image-to-image translation networks [26] (UNIT), Symmetric bi-directional adaptive gan [40] (SBADA-GAN), Generate to adapt [41] (GenToAdapt), Cycle-consistent adversarial domain adaptation [17] (CyCADA) and Image to image translation for domain adaptation [34] (I2I Adapt). As can be seen from Tab. 5 TriGAN does better in two out of three adaptation settings. It is only worse in the MNIST → MNIST-M setting where it is the third best. It is to be noted that TriGAN does significantly well in MNIST → SVHN adaptation which is particularly considered as a hard setting. TriGAN is 5.2% better than the second best method SBADA-GAN for MNIST → SVHN.