MapFlow: latent transition via normalizing flow for unsupervised domain adaptation

Unsupervised domain adaptation (UDA) aims at enhancing the generalizability of the classification model learned from the labeled source domain to an unlabeled target domain. An established approach to UDA is to constrain the classifier on an intermediate representation that is distributionally invariant across domains. However, recent theoretical and empirical research has revealed that relying only on invariance fails to guarantee a small target error, thus making equality in the distribution of representations unnecessary. In this paper, we propose to relax invariant representation learning by finding a general relationship between the source and target representations, which allows an interchange of the more discriminative domain information. To this end, we formalize the MapFlow framework, which explicitly constructs an invertible mapping between the target encoded distribution and variationally induced source representation. Empirical results on public benchmark datasets show the desirable performance of our proposed algorithm compared to state-of-the-art methods.


Introduction
Deep learning (DL) is currently the most widespread and successful methodology in machine learning (Voulodimos et al., 2018;Litjens et al., 2017). The epitome of such success is the outstanding performance of DL models on image classification tasks (Tan & Le, 2019;Chu et al., 2021). The accuracy of deep classifiers on the ImageNet Large Scale Visual Recognition (Krizhevsky et al., 2012) challenge has to date appreciated to %97 (Tan & Le, 2019), even surpassing human-level performance. However, they perform poorly when tested on out-of-distribution data, preventing them from being safely deployed in real-world settings. As a result, this tends to require a massive amount of human and computational resources to annotate the test data. Unsupervised domain adaptation (UDA) seeks to facilitate the burden of the annotation process by transferring predictive models learned from the labeled training (source) domain to the unlabeled test (target) domain.
In the seminal work of domain-adversarial neural network (DANN) (Ganin et al., 2016), a discriminator is trained to distinguish between the representations of the source and target domains, while a generator learns to deceive the discriminator by generating domaininvariant representations. Despite the impressive results that DANN gained, it suffers from a critical restriction. The arbitrary transformation of the generator is prone to produce ambiguous target features that may even be specific to the source domain. Consequently, it may deteriorate the target feature discriminability, though it enhances the feature transferability. A plethora of variants is proposed to ameliorate the discriminability of target features by (i) adjusting the classifier's decision boundaries (Saito et al., 2018;Lee et al., 2019;Shu et al., 2018;Chen et al., 2020;Jiang et al., 2020), (ii) regularizing the norm of invariant representation Xu et al., 2019; or conditional prediction features , (iii) tackling mode collapse issue Pei et al., 2018), (iv) encouraging task-related distribution matching , and (iv) utilizing domain-specific variations when separating the domain-invariant representations (Bousmalis et al., 2016;Gong et al., 2019;.
Nevertheless, all the advances mentioned above disregarded domain-specific characteristics and relied only on the invariant features, which tends to be insufficient for a well-performed classification. There are often variations associated with each domain that is unique and can contribute significantly to in-domain classification performance when leveraged.
Based on this observation, in this paper, we propose to relax invariance enforcement, a significant cause for inadaptability (Bouvier et al., 2019), by exploiting both domainspecific and invariant knowledge in capturing the interrelation between source and target 1 3 domains in the representation space. Accounting for the fact that a representation space may be more suitable for the target domain than it is for the source domain, we aim to find the relationship between source and target representations by learning a transformation from source to target domain in the feature space. Hence, we propose MapFlow, a general framework to relax domain invariance. MapFlow framework (MFF) relies on normalizing flow to learn a bijective, non-linear transformation between the encoded target distribution and a flexible latent prior induced directly from the source latent space by variational inference.
MFF enables us to explicitly model target latent knowledge by efficiently regularizing the log determinant of the Jacobian. The maximization of the determinant of Jacobian helps to alleviate the distributional divergence by establishing a geometrical relationship between the source and target representations. Explicit latent distribution modeling has been explored for UDA (Liu et al., 2017;Grover et al., 2019;Zhu et al., 2019), where the source and target latent distribution are modeled as predefined parametric distributions. However, different from those methods, normalizing flow (NF), a specific type of INN with an easily computable determinant of the Jacobian, is employed to model the likelihood of complex target latent distribution. In addition, despite adversarial domain adaptation in representation space that may fail to achieve multimodal alignment, MFF can preserve the multimodal structure of target latent space, which is suitable for discriminative mapping or alignment.
The contribution of our work is as follows. First, we present a motivating scenario to relax the excessive invariance in representation learning for UDA and propose a relaxedinvariant objective in representation learning that overcomes the limitations of standard objectives. In particular, from a probabilistic perspective, we mathematically derive a lower bound on the joint probability distribution of the source and target domains as a unified framework and general objective for UDA called MapFlow. MapFlow enables us to (1) exploit a more complex distribution for the target domain for which we can model the density when the source latent distribution is known and (2) leverage the relationship between the two domains rather than enforcing them to follow a simple and strict constraint (e.g., to be Gaussian distributed). Second, we empirically show that our proposed MapFlow loss improves the performance for the discriminability of the target domain.
The rest of the paper is organized as follows: the related work is detailed in Sect. 2. The preliminary concepts are presented in Sect. 3. The motivational insight is investigated in Sect. 4. The proposed approach is discussed in Sect. 5, followed by an experimental evaluation in Sect. 6, tackling the image classification performance. The Sect. 7 concludes the paper.

Unsupervised domain adaptation (UDA)
The success of supervised machine learning relies on the availability of a large amount of annotated training data from different domains, which is often cost-ineffective to collect and unrealistic in many cases. Unsupervised domain adaptation (UDA) aims to overcome this problem by transferring discriminative features extracted from the label-abundant source domain to the unlabelled target domain. A variety of methods has been proposed in the literature to attain adaptation. Apart from improvements in architecture designs (Li et al., 2016;Maria Carlucci et al., 2017;Wang et al., 2019Xu et al., 2021) and optimization strategies Acuna et al., 2022;Rangwani et al., 2022), these methods can be generally divided into three categories by resorting to the fundamental questions of what to adapt-the data or the model Huang et al., 2021;Kundu et al., 2022)-when to adapt-during training or testing Gao et al., 2022)-and how to adapt-by learning a domain-invariant representation (Sun & Saenko, 2016;Ganin et al., 2016) or by domain mapping (Liu & Tuzel, 2016;Liu et al., 2017;Murez et al., 2018;Gong et al., 2019;Hoffman et al., 2018).
The difficulty in UDA is how to resolve the distributional shift between the source and target domains, which is mathematically characterized by the difference in joint probability distribution p t (x, y) ≠ p s (x, y) . The UDA problem is generally infeasible unless we make some assumptions as to how the test distribution may alter. One of the most common assumptions is covariate shift, which assumes that the distributional shift is merely caused by inconsistency in the feature space, i.e., p t (x) ≠ p s (x) . Importance Sampling is employed by Shimodaira (2000) to bridge the distributional gap via a weighting mechanism w(x) = p t (x) p t (x) . However, the shift between two domains with high dimensional data, such as images, stems from non-overlapping supports, thus requiring unbounded weights. Ben-David et al. (2007) theoretically analyzed that the non-overlapping supports can be reconciled by learning a representation that exhibits invariance across domains, leading to numerous algorithms for UDA (Sun & Saenko, 2016;Sun et al., 2017;Ganin et al., 2016;Usman et al., 2020;Saito et al., 2018;Lee et al., 2019;Shu et al., 2018;Chen et al., 2020;Jiang et al., 2020;Chen et al., 2019;Xu et al., 2019;Long et al., 2018;Pei et al., 2018;Bousmalis et al., 2016;Gong et al., 2019;. This approach increases the transferability of features since high transferability is close to an invariant representation, while low transferability implies more domain-specific features. However, not only may invariance learning substantially deteriorate the adaptability as conclusively proved (Wu et al., 2019;Johansson et al., 2019;Zhao et al., 2019;Arjovsky et al., 2019;Bouvier et al., 2019), but also potentially neglect domain-specificity that can be incredibly beneficial in target feature discriminability. Therefore, in this paper, we aim to relax restrictive invariance by preserving domain-specific properties enforced by reconstruction.

Normalizing flow
A normalizing flow (NF) (Papamakarios et al., 2021) learns to transform an unknown, complex distribution to a simple distribution by a well-designed invertible network. NF models have been applied to several machine learning tasks, including image generation (Dinh et al., 2016;Kingma & Dhariwal, 2018), semi-supervised learning (Izmailov et al., 2020), inverse problems (Ardizzone et al., 2018), distribution matching (Usman et al., 2020), and domain adaptation (Gong et al., 2019;Grover et al., 2019;Sagawa & Hino, 2022). For example, the log-likelihood ratio minimizing flows (LRMF) (Usman et al., 2020) leverages invertible flow networks and density estimation for distribution matching without adversarial training and defines a new metric based on the log-likelihood ratio. The density model is not fixed and is trained to fit the mixture 1 2 p(z s ) + 1 2 p(x t ) . As for domain adaptation, Gong et al. (2019) proposed domain flow (DLOW) to generate multiple intermediate domains along the data manifolds between the source and target domains using normalizing flow to reduce the domain shift. Sagawa and Hino (2022) used NF to generate nonadjacent intermediate domains between the source and target domains to solve UDA based on a gradual self-training idea (Kumar et al., 2020). AlignFlow (Grover et al., 2019) trains two normalizing flows separately to map the source and the target domain to a common latent space with a Gaussian distribution and employ adversarial discriminators to execute further distribution alignment. In contrast, in this paper, we use NF model to transform discriminative source distribution to the target one in the representation space.

Notation and problem definition
Let X and Y be the input and output space, respectively. Z is the representation space generated from X by a feature transformation g ∶ X → Z . Accordingly, we use X, Y, Z as random variables from spaces X , Y , Z , and let lower-case variables x , y , and z denote the corresponding sample values respectively. We also define an output labeling function ∶ Z → Y and a composite predictive transformation g• . Given n s labeled samples of , UDA aims to transfer the predictive knowledge learned from the source domain to the target domain.

Normalizing flow for transformation
The normalizing flow (Dinh et al., 2016;Kingma & Dhariwal, 2018) is a likelihood-based generative model defined as an invertible mapping F ∶ X → Z from the observed space X to the latent space Z . The distribution of the observed variable can be modeled by applying a chain of invertible transformations, which is composed of a sequence of invertible func- . Based on the change of variables formula, the probability distribution of the transformed random variable can be written as follows: is characterized by a neural network with an architecture designed to ensure the invertibility and efficient computation of determinants. We train the model by computing the negative log-likelihood of the training data D = {x i } N i=1 with respect to the parameters .
Affine Coupling Layer is a powerful reversible transformation introduced in Dinh et al. (2016). Based on Dinh et al. (2016), the D dimensional input data z is partitioned into two vectors z 1 = z 1∶d and z 2 = z d+1∶D with d < D . The output of one affine coupling layer is given by y 1 = z 1 , y 2 = z 2 ⊙ exp(s(z 1 )) + t(z 1 ) where s and t represent functions from ℝ d → ℝ D−d and ⊙ is the Hadamard product. The inverse of the transformation is given by z 1 = y 1 , z 2 = (y 2 − t(y 1 )) ⊙ exp(−s(y 1 )) . The determinant of the Jacobian matrix of this transformation is explicitly derived as det y z =

Motivational insight
In this section, we motivate our approach by highlighting a key issue with invariant representation learning. Consider the error of a predictor with respect to the true labelling function under distribution D with joint probability distribution p(x, y) to be as: Then for the target domain, we have: Let ∫ p t (z)r s (z)dz add to and subtract from Eq. (5), then we have: The third term in Eq. (7) is zero when p t (z) = p s (z) , and the second term can become zero when the labeling function on representation space remains fixed between the source and target domains. Indeed, we have have labels for the target domain, we have no control over the second term. (Wu et al., 2019) studied an upperbound to the third term of Eq. (7), as follows: This upperbound shows that if s (h) = 0 , then the condition p t (z) = p s (z) is no longer needed to make the third term in Eq. (7) equal to zero. Note that in domain-invariant representation learning, we assume that the ratio p t (z) p s (z) is equal to 1. Therefore, this equality enforcement ( p t (z) p s (z) = 1 ) may deteriorate the adaptability. As a result, we suggest relaxing this equality to , which helps to transfer the cross-domain knowledge between the latent spaces without any loss of information.

The MapFlow framework
The learning of a joint distribution of source and target data has been studied for domain adaptation (Long et al., 2013;Liu et al., 2017;Courty et al., 2017;Damodaran et al., 2018). However, these methods assume shared latent space or cycle-consistency, which are both rather restrictive, as they impose strict constraints while modeling complex distributions in the latent space (Bouvier et al., 2019;Johansson et al., 2019). To tackle this, a general framework is presented to infer the joint distribution from the marginal ones without any additional assumption on the structure of the joint distribution. In this framework, we generalize the relationship between source and target representations by using an invertible neural network, through which the distribution of the target representation can be modeled without enforcing a strict constraint. We formulate the lower bound on the joint probability distribution over input spaces, which can be leveraged for the following multi-task learning objectives: 1) image translation between two domains, 2) sampling, and 3) classification.

Framework for joint distribution
We define a joint distribution over image samples and associated labels on the source and target domains as p (x t , x s , y t , y s ) . Assuming the conditional independence between y t and x s given x t , and also the conditional independence between y s and x t given x s , the joint distribution can be factorized under the chain rule as follows: where = { , , } represents the model parameters. The third term in Eq. (9) can be interpreted as the probability of the model on target samples, the second term is the classification model on source samples, and the first term is the joint probability distribution over data samples, which can be defined as follows, by considering z t and z s as the latent variables to model the source and target distributions: where finding the maximum likelihood of such joint distribution is generally intractable. Thus, we leverage variational inference for jointly modeling distribution. We assume joint variational posterior as q (z t , z s |x t , x s ) , then the joint log-evidence lower bound (ELBO) can be derived as follows: where the first expectation term is a reconstruction error, the second one refers to the joint prior distribution, and the third expectation term minimizes the entropy of variational posterior. The reconstruction term can be factorized p (x t , x s |z t , z s ) = p t (x t |z t , z s )p s (x s |z s ) by assuming the conditional independence between x t and x s given z t . To simplify the third term on the right-hand side (RHS) of Eq. (11), we formulate a factorized variational posterior of the form q (z t , z s |x t , x s ) = q t (z t |x t )q s (z s |x s ) , which is consistent with the conditional independence assumption between latent space of one domain and the input space of the other. Also, we define z t = f (z s ) , which leads to factorization of joint prior as p (z t , z s ) = p (z t |z s )p (z s ) . Taking all these terms into account and using the chain rule along with Eq. (11), we can derive the final ELBO loss as follows: where = ( sr , tr , kl , f ) are regularization parameters. Further details about the mathematical derivation of this loss can be found in the Appendix A) An illustration of our general framework is provided in Fig. 1. It consists of one feature extractor (encoder) g s (x s ; s ) to learn posterior distribution for the source domain. We rely on variational inference (VI) to find an approximation g s (x s ; s ) = q s (z s |x s ) for the true latent posterior distribution p (z s |x s ) , which is parameterized by a deep neural network with parameters s . Therefore, the representation space of the source domain is forced to be Gaussian with distribution N(z s | s (x s ), 2 s (x s )) , which can be used as a prior to model target representation. For the target domain, on the other hand, an invertible neural network constructed by affine coupling layers, which facilitates to compute the Jacobian J = f −1 z t , has been utilized to estimate the density of target encoded samples g t (x t ; t ) = q t (z t |x t ). Let z s with dimension d be the encoded latent variable for unit Gaussian distribution p(z s ) and let z t ∈ Z t be an observation from an unknown target distribution z t ∼ p(z t ) .
Given f ∶ z s → z t , we define a model p (z t ) with parameters on Z t , and we can compute the negative log-likelihood (NLL) of z t by the change of variable formula. For a single unlabeled target datapoint, the unsupervised objective can be derived as follows: where p is the prior distribution for the source domain. The minimization of this loss helps to generate a mapping of each unlabeled target sample into the corresponding embedding space.
Note that L in Eq. (12) has five terms, including target reconstruction, source reconstruction, a prior term for source domain, which can be learned with another invertible network, the entropy of source dataset, and a mapping objective from target to source. In our method, the transfer properties are enforced by reconstructing and translating input images. The range of the source representation part has been restrained to be Gaussian.
The second term in Eq. (9) is a predictive function on source datasets. Assuming that p (z s |x s ) can be approximated by the variational posterior q s (z s |x s ) , we have: The predictive function ∶ Z s → Y s enforces separability between classes, As for the third term in Eq. (9), since we have no labels for the target domain, to learn a discriminative target representation, we follow (Shu et al., 2018;Kumar et al., 2018), and apply low-density and smoothness assumptions by assuming a conditional entropy (CE) minimization and virtual adversarial training (VAT).  While the conditional entropy minimization (Eq. (16)) forces the predictor to be confident in the unlabeled target data by pushing the decision boundaries away from the target data, VAT loss (Eq. (17)) enforces prediction consistency within the neighborhood of training samples. Note that VAT can be applied on both or either of the source and target distributions.
The overall objective of our proposed MFF to be minimized is given by: The objective is overly complex to train the model with. Hence, we further assume a shared encoder ( s = t = ), and a shared decoder ( s = t = ) for the source and target domains. Moreover, we let the translation loss, i.e., L tr in Eq. (12), to be learned adversarially by employing a discriminator d with extra parameter d . Therefore, the overall learning objective will be redefined as follows: where = ( , , , ) are all parameters to be learned, and = ( sr , tr , kl , f , s , t ) are regularization parameters. To simplify the training objective, we also tried to pre-train the flow model with supposedly Gaussian prior by using target auto-encoder (Xiao et al., 2019).

Experiments
In this section, we first present the experimental setup, and then we provide details of the implementation of our model, followed by the results, where we compare our model with the SOTA methods in UDA and qualitative analysis of the method.

Baselines
We primarily compare our proposed MapFlow with three baselines: ALDA , MDD+Implicit (Jiang et al., 2020), and VADA (Shu et al., 2018). We also show the results of several other recently proposed UDA models for comparison, including Maximum Classifier Discrepancy (MCD) (Saito et al., 2018), Joint Adaptation Network (JAN) (Long et al., 2017), Self-Ensembling(S-En) (French, 2017), and Conditional Domain Adversarial Networks (CDAN) . For a fair comparison, the results are reported from the original papers if available. For all the experiments, we will report the results in terms of accuracy for each domain shift, repeating the experiments 3 times and averaging the results.

Architecture
In order to make fair comparisons for digits and CIFAR10/STL datasets, we adopt the architectural components, including the classifier network, the feature extractor, and the discriminator used in DIRT-T (Shu et al., 2018). Similarly, we use a small architecture for the digits UDA tasks and a larger architecture for UDA experiments between CIFAR-10 and STL-10. For office-31 and VisDA 2017 datasets, we employ ResNet-50 (He et al., 2016), which is pre-trained on ImageNet (Russakovsky et al., 2015), as the feature extractor. The discriminator network is composed of two fully connected layers with dropout (Ganin et al., 2016). Note that our architecture is slightly different as we include an invertible feature transform to the classifier network; however, the invertible network only adds a small parameter overhead on the shared feature extractor and classifier (less than 4%). For the invertible network applied on latent variables, we use Glow architecture (Kingma & Dhariwal, 2018) with 4 affine coupling blocks, where each block contains 3 fully connected layers, each with 256 or 512 hidden units depending on the dataset.

Training settings and hyper-parameters
For digits and CIFAR10/STL datasets, we implement adversarial training via alternating updates (Shu et al., 2018), and train the model using Adam optimizer (Kingma & Ba, 2014) with learning rate 10 −3 decaying by a factor of 2 after 200 epochs.
For office-31 and VisDA-2017 datasets, we follow  and all the protocols, including optimizer and learning rate strategy. We optimize the model using Stochastic Gradient Descent (SGD) optimizer with a momentum of 0.9 and an adjusted earning rate p = 0 (1 + q) , where 0 = 0.01 , = 10 , = 0.75 , and q is the training progress linearly decreasing from 1 to 0. Note that we set the learning rates of the classifier and discriminator to be 10 times that of the generator.
As for hyper-parameters ( sr , tr , kl , s , t ) , we tune the values for each dataset using cross validation. We observed that extensive hyper-parameter tuning is not required to obtain top-performance results. Accordingly, we limit the hyper-parameter search for each task to sr = tr = {10 −1 , 10 −2 }, kl = {1, 10 −1 }, s = {1}, t = {0, 1, 10 −1 , 10 −2 } Table 1 summarizes the results of the average accuracy ( % ) on the standard classification benchmarks for UDA, such as digits, CIFAR-10, and STL data sets, compared with SOTA methods. For fair comparison, we resize all images to 32 × 32 × 3 (except in case of adaptation from USPS to MNIST) and apply instance normalization (Shu et al., 2018) to input images. Below, we present a brief analysis of the results in Table 1.

Results
USPS→MNIST: although USPS contains a smaller training set than MNIST, domain discrepancy between these two datasets is relatively small, and we could achieve high performance in USPS → MNIST.
MNIST↔SVHN: for the adaptation task SVHN → MNIST, we modify the dimension of MNIST to 32 × 32 of SVHN, with three channels. This adaptation problem is easily solved When the proposed MapFlow is applied. Our method could demonstrate a performance similar to the SOTA DTA (Lee et al., 2019) on MNIST. The reverse problem, the adaptation task MNIST → SVHN, can be regarded as the most challenging case in digit datasets, as MNIST has a considerably lower dimensionality than SVHN. Experiments show that MapFlow could achieve state-of-the-art results on this adaptation task. On average, MapFlow achieved . % improvements compared with the method of DIRT-T (Shu et al., 2018). The improvement shows the importance of relaxed invariant representation.
CIFAR-10↔STL-10: in both adaptation directions, results in Table 1 show that MapFlow is slightly better than the SOTA, which we believe is due to the relatively smaller training set for STL and the existing imbalance between two datasets.
The results in Table 2 show again the superiority of our approach compared to other recently proposed methods on Office-31 datasets. We evaluate MapFlow across six   96.5 -90.4 --Π-model (aug) (French, 2017) 97 .

UDA tasks:
Our method surpasses the baselines in 3 out of 6 pairs of adaptation tasks for Office-31. We further demonstrate the generalization ability of the proposed method by conducting additional experiments on VisDA-2017. In our experiments, we observed a gain of 0.6 points over the baseline , confirming the flexibility of MapFlow and its applicability across UDA tasks. The SOTA results with ResNet-50 are reported in Table 3.

Ablation studies
To examine the relative contribution of the invertible network in MapFlow, we conduct ablations on the adaptation tasks presented in Table 1, with and without the loss term of E.q 13. The results are reported in Table 4, where the "no-nf" subscript denotes the removal of the NF component. We observe that when the loss, including the term for the log determinant of Jacobian, is applied (MapFlow), our method demonstrates a significant improvement over MapFlow no−nf and previous works. These results demonstrate the effectiveness of the flow model in relaxing invariance enforcement.

Qualitative analysis
To further analyze the relaxed invariant representation, we visualize the non-adapted and adapted feature representations generated from the last hidden layer of the model on SVHN → MNIST UDA task using t-SNE (Van der Maaten & Hinton, 2008). As illustrated in Fig. 2, source-only training or Non-adapted model shows strong clustering of the SVHN samples and performs poorly on MNIST (Fig. 2a). MapFlow delivers higher feature discriminability in the target domain by keeping each class well separated without enforcing the target clusters to be completely aligned with the source domain (Fig. 2c).  (Ganin et al., 2016)

Table 3
Test accuracy (

Target error bound
The learning theory of UDA was initially proposed by Ben-David (Ben-David et al., 2010) and is summarized in Theorem 1.
Theorem 1 (Ben-David et al., 2010) Let H = { •g ∶ ∈ Φ, g ∈ G} be the hypothesis space, where G and Φ are considered to be the set of representations and predictive functions respectively, and let (h) be the risk for h ∈ H , and (h, h � ) be the risk for (h, h � ) ∈ H 2 .
where d HΔH in the second term denotes HΔH distance between source and target domains, Ψ(h) is the shared error of the ideal joint hypothesis, and We analyze the second term, domain discrepancy, and the third term, ideal joint hypothesis error, of the target error bound, as formulated in Eq. (20), on SVHN → MNIST task.
(20)   2 ) , where denotes the error of a domain classifier trained to discriminate the source and target representations. As illustrated in Fig. 3a, MapFlow minimizes domain discrepancy more significantly than standard domain adversarial training but does not as much as VADA (Shu et al., 2018) method does, implying a relaxed invariance.
Ideal Joint Hypothesis We evaluate the Ideal Joint Hypothesis by training an MLP classifier with two layers on the adapted features from both target and source domains, as suggested in . As shown in Fig. 3b, MapFlow reduces the joint error, which indicates that our method improves the feature discriminability.

Conclusion and future work
In this paper, a novel relaxed invariant representation learning is presented for unsupervised domain adaptation. In standard domain invariance learning, the transferability of feature representations is enhanced at the expense of its discriminability. Thus, we propose a general framework to relax invariance enforcement in representation space. Our method aims at encouraging a combination of domain invariance and specificity to enhance the target discriminability. The framework relies on normalizing flow to learn a transformation between the distribution of target and source domains in representation space. In fact, normalizing flow maps a complex target latent distribution into a well-clustered latent source distribution through a sequence of invertible functions. We mathematically derived a variational lower bound for the probability distribution changing across domains and showed the consistency of the lower bound with the relaxed invariance assumption. Through extensive experiments, our approach demonstrates its superiority to other methods based on invariant representations on several public UDA datasets, validating our analysis. In future work, we intend to extend our model to work in the presence of label and conditional shifts for domain adaptation.

Appendix A derivation of the ELBO
The last result from the Eq. (A1) is the ELBO of the the joint distribution. We further assume the conditional independence between source and target distribution: Therefore, we can derive the following result.
We rewrite the last result of Eq. (A3).
(A1) log p(x t , x s ) = log � p(x t , x s |z t , z s )p(z t , z s )dz s dz t = log � p(x t , x s |z t , z s )p(z t , z s ) q(z t , z s |x t , x s ) q(z t , z s |x t , x s ) dz s dz t = log q(z t ,z s |x t ,x s ) p(x t , x s |z t , z s ) p(z t , z s ) q(z t , z s |x t , x s ) ≥ q(z t ,z s |x t ,x s ) log p(x t , x s |z t , z s ) p(z t , z s ) q(z t , z s |x t , x s ) = q(z t ,z s |x t ,x s ) log(p(x t , x s |z t , z s )) + log(p(z t , z s )) − q(z t ,z s |x t ,x s ) log(q(z t , z s |x t , x s )) (A2) q(z t , z s |x t , x s ) = q(z t |x t )q(z s |x s ), p(x t , x s |z t , z s ) = p(x t |z t , z s )p(x s |z s ).
(A3) q(z t |x t )q(z s |x s ) log(p(x t |z t , z s )p(x s |z s )) + log(p(z t , z s )) − log(q(z t |x t )q(z s |x s )) = q(z t |x t )q(z s |x s ) log(p(x t |z t , z s ) + log(p(x s |z s )) + log(p(z t , z s )) − q(z t |x t )q(z s |x s ) log(q(z t |x t )) − log(q(z s |x s )) = q(z t |x t )q(z s |x s ) log(p(x t |z t , z s ) + q(z s |x s ) log(p(x s |z s )) + q(z t |x t )q(z s |x s ) log(p(z t , z s )) − log(q(z t |x t )) − log(q(z s |x s )) = q(z t |x t )q(z s |x s ) [log(p(x t |z t , z s )] + q(z s |x s ) [log(p(x s |z s ))] + q(z t |x t )q(z s |x s ) [log(p(z t , z s ))] − q(z s |x s ) [log(q(z s |x s ))] − q(z t |x t ) [log(q(z t |x t ))].

(A4)
= q(z t |x t )q(z s |x s ) log(p(x t |z t , z s ) ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ (1) + q(z s |x s ) log(p(x s |z s )) ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ (2) + q(z t |x t )q(z s |x s ) log(p(z t , z s )) ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ − q(z t |x t ) log(q(z t |x t ) ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ − q(z s |x s ) log(q(z s |x s )) ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞ ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞ ⏟ , Nothing that then, the term (1) in Eq. (A4) can be redefined as By assuming that then, the term (3) in Eq. (A4) can be redefined as We also have Putting all of them together we have the following final loss: Author contributions YL contributed to this work through technical discussion, paper edition, and revision. HS contributed to this work through technical discussion, paper editions, revision, and supervision. I am submitting the enclosed manuscript for potential publication only in Machine Learning Journal. I attest that this paper has not been published anywhere and is prepared following the instructions to authors. The (A5) p(x t |z t , z s ) = p(z t , z s |x t )p(x t ) p(z t , z s ) (Bayes' theorem) = p(z t |z s , x t )p(z s |x t )p(x t ) p(z t , z s ) (Chain rule) (1) = q(z t |x t )q(z s |x s ) log(p(x t |z t , z s ) = q(z t |x t )q(z s |x s ) log(p(x t |z s )) = q(z s |x s ) log(p(x t |z s )) = q(z s |x s ) log � p(x t |z t )p(z t |z s )dz t