1 Introduction

Deep learning (DL) is currently the most widespread and successful methodology in machine learning (Voulodimos et al., 2018; Litjens et al., 2017). The epitome of such success is the outstanding performance of DL models on image classification tasks (Tan & Le, 2019; Chu et al., 2021). The accuracy of deep classifiers on the ImageNet Large Scale Visual Recognition (Krizhevsky et al., 2012) challenge has to date appreciated to \(\%97\) (Tan & Le, 2019), even surpassing human-level performance. However, they perform poorly when tested on out-of-distribution data, preventing them from being safely deployed in real-world settings. As a result, this tends to require a massive amount of human and computational resources to annotate the test data. Unsupervised domain adaptation (UDA) seeks to facilitate the burden of the annotation process by transferring predictive models learned from the labeled training (source) domain to the unlabeled test (target) domain.

To tackle UDA, a broad spectrum of methods has been proposed (Sun & Saenko, 2016; Ganin et al., 2016; Cui et al., 2020; Liu & Tuzel, 2016; Murez et al., 2018; Hoffman et al., 2018; Saito et al., 2017; Sun et al., 2019; Lee et al., 2019; Liu et al., 2019, 2021). The prevailing approach is to train a classifier on the source domain while finding a relationship between the source and target domains, primarily by matching their distributions in the representation space (also known as domain-invariant representation). Such invariant representations have been achieved via matching distribution properties, such as statistical moments (Long et al., 2015; Sun & Saenko, 2016; Sun et al., 2017) and supports (Tong et al., 2022), or matching full distribution (Ganin et al., 2016; Saito et al., 2017; Usman et al., 2020; Nguyen et al., 2021; Courty et al., 2017). A notable, top-performing example of the latter is adversarial domain adaptation (Ganin & Lempitsky, 2014; Ganin et al., 2016; Saito et al., 2017), which yielded remarkable performance gains in UDA.

In the seminal work of domain-adversarial neural network (DANN) (Ganin et al., 2016), a discriminator is trained to distinguish between the representations of the source and target domains, while a generator learns to deceive the discriminator by generating domain-invariant representations. Despite the impressive results that DANN gained, it suffers from a critical restriction. The arbitrary transformation of the generator is prone to produce ambiguous target features that may even be specific to the source domain. Consequently, it may deteriorate the target feature discriminability, though it enhances the feature transferability. A plethora of variants is proposed to ameliorate the discriminability of target features by (i) adjusting the classifier’s decision boundaries (Saito et al., 2018; Lee et al., 2019; Zhang et al., 2019; Shu et al., 2018; Chen et al., 2020; Jiang et al., 2020), (ii) regularizing the norm of invariant representation (Chen et al., 2019; Xu et al., 2019; Jin et al., 2020) or conditional prediction features (Cui et al., 2020; Jin et al., 2020), (iii) tackling mode collapse issue (Long et al., 2018; Pei et al., 2018), (iv) encouraging task-related distribution matching (Wei et al., 2021; Jin et al., 2020), and (iv) utilizing domain-specific variations when separating the domain-invariant representations (Bousmalis et al., 2016; Cui et al., 2020; Gong et al., 2019; Cui et al., 2020).

Nevertheless, all the advances mentioned above disregarded domain-specific characteristics and relied only on the invariant features, which tends to be insufficient for a well-performed classification. There are often variations associated with each domain that is unique and can contribute significantly to in-domain classification performance when leveraged.

Based on this observation, in this paper, we propose to relax invariance enforcement, a significant cause for inadaptability (Bouvier et al., 2019), by exploiting both domain-specific and invariant knowledge in capturing the interrelation between source and target domains in the representation space. Accounting for the fact that a representation space may be more suitable for the target domain than it is for the source domain, we aim to find the relationship between source and target representations by learning a transformation from source to target domain in the feature space. Hence, we propose MapFlow, a general framework to relax domain invariance. MapFlow framework (MFF) relies on normalizing flow to learn a bijective, non-linear transformation between the encoded target distribution and a flexible latent prior induced directly from the source latent space by variational inference.

MFF enables us to explicitly model target latent knowledge by efficiently regularizing the log determinant of the Jacobian. The maximization of the determinant of Jacobian helps to alleviate the distributional divergence by establishing a geometrical relationship between the source and target representations. Explicit latent distribution modeling has been explored for UDA (Liu et al., 2017; Grover et al., 2019; Zhu et al., 2019), where the source and target latent distribution are modeled as predefined parametric distributions. However, different from those methods, normalizing flow (NF), a specific type of INN with an easily computable determinant of the Jacobian, is employed to model the likelihood of complex target latent distribution. In addition, despite adversarial domain adaptation in representation space that may fail to achieve multimodal alignment, MFF can preserve the multimodal structure of target latent space, which is suitable for discriminative mapping or alignment.

The contribution of our work is as follows. First, we present a motivating scenario to relax the excessive invariance in representation learning for UDA and propose a relaxed-invariant objective in representation learning that overcomes the limitations of standard objectives. In particular, from a probabilistic perspective, we mathematically derive a lower bound on the joint probability distribution of the source and target domains as a unified framework and general objective for UDA called MapFlow. MapFlow enables us to (1) exploit a more complex distribution for the target domain for which we can model the density when the source latent distribution is known and (2) leverage the relationship between the two domains rather than enforcing them to follow a simple and strict constraint (e.g., to be Gaussian distributed). Second, we empirically show that our proposed MapFlow loss improves the performance for the discriminability of the target domain.

The rest of the paper is organized as follows: the related work is detailed in Sect. 2. The preliminary concepts are presented in Sect. 3. The motivational insight is investigated in Sect. 4. The proposed approach is discussed in Sect. 5, followed by an experimental evaluation in Sect. 6, tackling the image classification performance. The Sect.  7 concludes the paper.

2 Related works

2.1 Unsupervised domain adaptation (UDA)

The success of supervised machine learning relies on the availability of a large amount of annotated training data from different domains, which is often cost-ineffective to collect and unrealistic in many cases. Unsupervised domain adaptation (UDA) aims to overcome this problem by transferring discriminative features extracted from the label-abundant source domain to the unlabelled target domain. A variety of methods has been proposed in the literature to attain adaptation. Apart from improvements in architecture designs (Li et al., 2016; Maria Carlucci et al., 2017; Wang et al., 2019, 2020; Xu et al., 2021) and optimization strategies (Wei et al., 2021; Acuna et al., 2022; Rangwani et al., 2022), these methods can be generally divided into three categories by resorting to the fundamental questions of what to adapt-the data or the model (Liang et al., 2020; Huang et al., 2021; Kundu et al., 2022)-when to adapt- during training or testing (Wang et al., 2020; Gao et al., 2022)-and how to adapt-by learning a domain-invariant representation (Sun & Saenko, 2016; Ganin et al., 2016) or by domain mapping (Liu & Tuzel, 2016; Liu et al., 2017; Murez et al., 2018; Gong et al., 2019; Hoffman et al., 2018).

The difficulty in UDA is how to resolve the distributional shift between the source and target domains, which is mathematically characterized by the difference in joint probability distribution \(p_t({\textbf{x}}, {\textbf{y}}) \ne p_s({\textbf{x}}, {\textbf{y}})\). The UDA problem is generally infeasible unless we make some assumptions as to how the test distribution may alter. One of the most common assumptions is covariate shift, which assumes that the distributional shift is merely caused by inconsistency in the feature space, i.e., \(p_t ({\textbf{x}}) \ne p_s ({\textbf{x}})\). Importance Sampling is employed by Shimodaira (2000) to bridge the distributional gap via a weighting mechanism \(\textstyle {w({\textbf{x}}) = \frac{p_t({\textbf{x}})}{p_t({\textbf{x}})}}\). However, the shift between two domains with high dimensional data, such as images, stems from non-overlapping supports, thus requiring unbounded weights. Ben-David et al. (2007) theoretically analyzed that the non-overlapping supports can be reconciled by learning a representation that exhibits invariance across domains, leading to numerous algorithms for UDA (Sun & Saenko, 2016; Sun et al., 2017; Ganin et al., 2016; Saito et al., 2017; Usman et al., 2020; Saito et al., 2018; Lee et al., 2019; Zhang et al., 2019; Shu et al., 2018; Chen et al., 2020; Jiang et al., 2020; Chen et al., 2019; Xu et al., 2019; Cui et al., 2020; Jin et al., 2020; Long et al., 2018; Pei et al., 2018; Wei et al., 2021; Jin et al., 2020; Bousmalis et al., 2016; Cui et al., 2020; Gong et al., 2019; Cui et al., 2020). This approach increases the transferability of features since high transferability is close to an invariant representation, while low transferability implies more domain-specific features. However, not only may invariance learning substantially deteriorate the adaptability as conclusively proved (Wu et al., 2019; Johansson et al., 2019; Zhao et al., 2019; Arjovsky et al., 2019; Bouvier et al., 2019), but also potentially neglect domain-specificity that can be incredibly beneficial in target feature discriminability. Therefore, in this paper, we aim to relax restrictive invariance by preserving domain-specific properties enforced by reconstruction.

2.2 Normalizing flow

A normalizing flow (NF) (Papamakarios et al., 2021) learns to transform an unknown, complex distribution to a simple distribution by a well-designed invertible network. NF models have been applied to several machine learning tasks, including image generation (Dinh et al., 2016; Kingma & Dhariwal, 2018), semi-supervised learning (Izmailov et al., 2020), inverse problems (Ardizzone et al., 2018), distribution matching (Usman et al., 2020), and domain adaptation (Gong et al., 2019; Grover et al., 2019; Sagawa & Hino, 2022). For example, the log-likelihood ratio minimizing flows (LRMF) (Usman et al., 2020) leverages invertible flow networks and density estimation for distribution matching without adversarial training and defines a new metric based on the log-likelihood ratio. The density model is not fixed and is trained to fit the mixture \(\textstyle {\frac{1}{2} p({{{\textbf{z}}}_s}) + \frac{1}{2} p({{{\textbf{x}}}_t})}\). As for domain adaptation, Gong et al. (2019) proposed domain flow (DLOW) to generate multiple intermediate domains along the data manifolds between the source and target domains using normalizing flow to reduce the domain shift. Sagawa and Hino (2022) used NF to generate nonadjacent intermediate domains between the source and target domains to solve UDA based on a gradual self-training idea (Kumar et al., 2020). AlignFlow (Grover et al., 2019) trains two normalizing flows separately to map the source and the target domain to a common latent space with a Gaussian distribution and employ adversarial discriminators to execute further distribution alignment. In contrast, in this paper, we use NF model to transform discriminative source distribution to the target one in the representation space.

3 Preliminaries

3.1 Notation and problem definition

Let \({\mathcal {X}}\) and \({\mathcal {Y}}\) be the input and output space, respectively. \({\mathcal {Z}}\) is the representation space generated from \({\mathcal {X}}\) by a feature transformation \(g:{\mathcal {X}} \rightarrow {\mathcal {Z}}\). Accordingly, we use X, Y, Z as random variables from spaces \({\mathcal {X}}\), \({\mathcal {Y}}\), \({\mathcal {Z}}\), and let lower-case variables \({\textbf{x}}\), \({\textbf{y}}\), and \({\textbf{z}}\) denote the corresponding sample values respectively. We also define an output labeling function \(\varphi : {\mathcal {Z}} \rightarrow {\mathcal {Y}}\) and a composite predictive transformation \(g \circ \varphi\). Given \(n_s\) labeled samples of source domain \(\{({\textbf{x}}_i,{\textbf{y}}_i) \mid {\textbf{x}}_i \in {{\mathcal {X}}}_s, {\textbf{y}}_i\in {\mathcal {Y}}_s, i= 1, 2, \ldots n_s\}\), with \(({\textbf{x}}, {\textbf{y}}) \sim p_{s}(X, Y)\), and unlabelled samples of target domain \(\{({\textbf{x}}_i) \mid {\textbf{x}}_i\in {\mathcal {X}}_t, i= 1, 2, \ldots n_t\}\), with \({\textbf{x}}\sim p_{t}(X)\), UDA aims to transfer the predictive knowledge learned from the source domain to the target domain.

3.2 Normalizing flow for transformation

The normalizing flow (Dinh et al., 2016; Kingma & Dhariwal, 2018) is a likelihood-based generative model defined as an invertible mapping \(F:{\mathcal {X}} \rightarrow {\mathcal {Z}}\) from the observed space \({\mathcal {X}}\) to the latent space \({\mathcal {Z}}\). The distribution of the observed variable can be modeled by applying a chain of invertible transformations, which is composed of a sequence of invertible functions \(f = f_1 \circ f_2 \circ \ldots \circ f_L:\mathbb R^{d} \rightarrow {\mathbb {R}}^{d}\) with inverse \(F = f^{-1}\), on random latent variables with known distribution \({\textbf{z}}\sim p(Z)\). Based on the change of variables formula, the probability distribution of the transformed random variable can be written as follows:

$$\begin{aligned} \begin{aligned} p_{X}({{\textbf{x}}})&= \;p_{Z}(f^{-1}({\textbf{x}})) \;\Big |\det (J_{f^{-1}}({\textbf{x}})) \Big |=\; p_{Z}(f^{-1}({\textbf{x}})) \prod _{l=1}^{L} \;\Big |\det (J_{{f_{l}}^{-1}}({\textbf{h}}_l)) \Big |, \end{aligned} \end{aligned}$$
(1)

where \(J_{f^{-1}}({\textbf{x}}) = {\partial f^{-1}({\textbf{x}})}/{\partial {\textbf{x}}}\) is the Jacobian of \(f^{-1}\) with respect to \({\textbf{x}}\), det(\(\cdot\)) denotes the determinant, and \({\textbf{h}}_l\) denotes the output of intermediate mapping \(f_{l}\), with \({\textbf{h}}_1={\textbf{x}}\) and \({\textbf{h}}_L=f_L({\textbf{z}})\). The mapping \(F({\textbf{x}})\) is characterized by a neural network with an architecture designed to ensure the invertibility and efficient computation of determinants. We train the model by computing the negative log-likelihood of the training data \(D=\{{\textbf{x}}_i\}_{i=1}^{N}\) with respect to the parameters \({\eta }\).

$$\begin{aligned} {\eta }^{*} = \mathop {\textrm{argmax}}\limits _{\eta } {\mathcal {L}}, \; {\mathcal {L}} = - \frac{1}{|D |} \sum _{{\textbf{x}}\in D} \log p({\textbf{x}}; \eta ) \end{aligned}$$
(2)

Affine Coupling Layer is a powerful reversible transformation introduced in Dinh et al. (2016). Based on Dinh et al. (2016), the D dimensional input data \({\textbf{z}}\) is partitioned into two vectors \({\textbf{z}}_1 = {\textbf{z}}_{1:d}\) and \({\textbf{z}}_2 = {{\textbf{z}}}_{d+1:D}\) with \(d < D\). The output of one affine coupling layer is given by \({\textbf{y}}_1 = {\textbf{z}}_1\), \({\textbf{y}}_2 = {\textbf{z}}_2 \odot \exp (s({\textbf{z}}_1)) + t({\textbf{z}}_1)\) where s and t represent functions from \({\mathbb {R}}^d \rightarrow {\mathbb {R}}^{D-d}\) and \(\odot\) is the Hadamard product. The inverse of the transformation is given by \({\textbf{z}}_1 = {\textbf{y}}_1\), \({\textbf{z}}_2 = ({{{\textbf{y}}}_2} - t({{{\textbf{y}}}_1})) \odot \exp (-s({\textbf{y}}_1))\). The determinant of the Jacobian matrix of this transformation is explicitly derived as \(\textstyle {\det \frac{\partial {{\textbf{y}}}}{\partial {{\textbf{z}}}}=\prod _{j=1}^d (\exp [s({\textbf{z}}_1)_j])}\).

4 Motivational insight

In this section, we motivate our approach by highlighting a key issue with invariant representation learning. Consider the error of a predictor \(\varphi\) with respect to the true labelling function \(\psi\) under distribution \({\mathcal {D}}\) with joint probability distribution \(p({\textbf{x}},{\textbf{y}})\) to be as: \(\textstyle {\varepsilon (g,\varphi ):= {\mathbb {E}}_{{\textbf{x}}\sim {\mathcal {D}}} \big [\big |\varphi (g({\textbf{x}})) - \psi ({\textbf{x}})\big |\big ]}\). Then for the target domain, we have:

$$\begin{aligned} \varepsilon _t(h) = \int p_t({\textbf{x}}) \big |\varphi (g({\textbf{x}}))-\psi ({\textbf{x}}) \big |d{\textbf{x}}, \end{aligned}$$
(3)

where \(r({\textbf{x}}) = \big |\varphi (g({\textbf{x}}))-\psi ({\textbf{x}}) \big |\) is the risk for input \({\textbf{x}}\). Following the change of variable rule (\(\frac{p({\textbf{x}})}{p({\textbf{z}})} = \frac{d{\textbf{x}}}{d{\textbf{z}}}\)), we then have

$$\begin{aligned} \varepsilon _{t} (h) = \int p_{t}({\textbf{z}}) \mid \varphi ({\textbf{z}}) - \psi _{t}({\textbf{z}})\mid d{\textbf{z}}= \int p_{t}({\textbf{z}}) r_{t}({\textbf{z}}) d{\textbf{z}}. \end{aligned}$$
(4)

Similar to proof presented by (Ben-David et al., 2010), \(\varepsilon _{t}(h)\) can be simply redefined as follows:

$$\begin{aligned} \begin{aligned} \varepsilon _{t} (h)&= \varepsilon _{t} (h) + \varepsilon _{s} (h) - \varepsilon _{s} (h) \\&= \varepsilon _{s} (h) + \int p_{t}({\textbf{z}}) \mid \varphi ({\textbf{z}}) - \psi _{t}({\textbf{z}})\mid d{\textbf{z}}\\ {}&- \int p_{s}({\textbf{z}}) \mid \varphi ({\textbf{z}}) - \psi _{s}({\textbf{z}})\mid d{\textbf{z}}\\&= \varepsilon _{s} (h) + \int p_{t}({\textbf{z}}) r_{t}({\textbf{z}}) d{\textbf{z}}- \int p_{s}({\textbf{z}}) r_{s}({\textbf{z}}) d{\textbf{z}}. \\ \end{aligned} \end{aligned}$$
(5)

Let \(\int p_{t}({\textbf{z}}) r_{s}({\textbf{z}}) d{\textbf{z}}\) add to and subtract from Eq. (5),

$$\begin{aligned} \begin{aligned} \varepsilon _{t} (h)&= \varepsilon _{s} (h) + \int p_{t}({\textbf{z}}) r_{t}({\textbf{z}}) d{\textbf{z}}- \int p_{s}({\textbf{z}}) r_{s}({\textbf{z}}) d{\textbf{z}}\\&\qquad + \int p_{t}({\textbf{z}}) r_{s}({\textbf{z}}) d{\textbf{z}}- \int p_{t}({\textbf{z}}) r_{s}({\textbf{z}}) d{\textbf{z}}, \end{aligned} \end{aligned}$$
(6)

then we have:

$$\begin{aligned} \begin{aligned} \varepsilon _{t} (h)&= \underbrace{\varepsilon _{s} (h)}_{\textcircled {1}} + \underbrace{\int p_{t}({\textbf{z}}) (r_{t}({\textbf{z}}) - r_{s}({\textbf{z}}))d{\textbf{z}}}_{\textcircled {2}} + \underbrace{\int (p_{t}({\textbf{z}})-p_{s}({\textbf{z}})) r_{s}({\textbf{z}})d{\textbf{z}}}_{\textcircled {3}}. \end{aligned} \end{aligned}$$
(7)

The third term in Eq. (7) is zero when \(p_{t}({\textbf{z}})=p_{s}({\textbf{z}})\), and the second term can become zero when the labeling function on representation space remains fixed between the source and target domains. Indeed, we have \(r_{t}({\textbf{z}})-r_{s}({\textbf{z}}) = \mid \varphi ({\textbf{z}})-\psi _{t}({\textbf{z}}) \mid -\mid \varphi ({\textbf{z}})-\psi _{s}({\textbf{z}}) \mid \le \mid \psi _{t}({\textbf{z}})-\psi _{s}({\textbf{z}}) \mid\). However, as we do not have labels for the target domain, we have no control over the second term. (Wu et al., 2019) studied an upperbound to the third term of Eq. (7), as follows:

$$\begin{aligned} \begin{aligned} \textcircled {3}&= \int {\bigg ({\frac{p_{t}({\textbf{z}})}{p_{s}({\textbf{z}})}}- 1 \bigg ) p_{s}({\textbf{z}}) r_{s}({\textbf{z}})d{\textbf{z}}} \le \bigg (sup_{{\textbf{z}}\in {\mathcal {Z}}}{\frac{p_{t}({\textbf{z}})}{p_{s}({\textbf{z}})}}- 1 \bigg ) \varepsilon _{s} (h) \end{aligned} \end{aligned}$$
(8)

This upperbound shows that if \(\varepsilon _{s} (h) = 0\), then the condition \(p_{t}({\textbf{z}}) = p_{s}({\textbf{z}})\) is no longer needed to make the third term in Eq. (7) equal to zero. Note that in domain-invariant representation learning, we assume that the ratio \(\textstyle {\frac{p_{t}({\textbf{z}})}{p_{s}({\textbf{z}})}}\) is equal to 1. Therefore, this equality enforcement (\(\textstyle {\frac{p_{t}({\textbf{z}})}{p_{s}({\textbf{z}})} = 1}\)) may deteriorate the adaptability. As a result, we suggest relaxing this equality to \(\textstyle {\frac{p_{t}({\textbf{z}})}{p_{s}({\textbf{z}})} = \;|\det (J_{{f}^{-1}}({\textbf{z}}_s)) |}\), which helps to transfer the cross-domain knowledge between the latent spaces without any loss of information.

5 The MapFlow framework

The learning of a joint distribution of source and target data has been studied for domain adaptation (Long et al., 2013; Liu et al., 2017; Courty et al., 2017; Damodaran et al., 2018). However, these methods assume shared latent space or cycle-consistency, which are both rather restrictive, as they impose strict constraints while modeling complex distributions in the latent space (Bouvier et al., 2019; Johansson et al., 2019). To tackle this, a general framework is presented to infer the joint distribution from the marginal ones without any additional assumption on the structure of the joint distribution. In this framework, we generalize the relationship between source and target representations by using an invertible neural network, through which the distribution of the target representation can be modeled without enforcing a strict constraint. We formulate the lower bound on the joint probability distribution over input spaces, which can be leveraged for the following multi-task learning objectives: 1) image translation between two domains, 2) sampling, and 3) classification.

5.1 Framework for joint distribution

We define a joint distribution over image samples and associated labels on the source and target domains as \(p_{\tau }({\textbf{x}}_t, {\textbf{x}}_s, {\textbf{y}}_t, {\textbf{y}}_s)\). Assuming the conditional independence between \({\textbf{y}}_t\) and \({\textbf{x}}_s\) given \({\textbf{x}}_t\), and also the conditional independence between \({\textbf{y}}_s\) and \({\textbf{x}}_t\) given \({\textbf{x}}_s\), the joint distribution can be factorized under the chain rule as follows:

$$\begin{aligned} p_{\tau }({\textbf{x}}_t, {\textbf{x}}_s,{\textbf{y}}_t, {\textbf{y}}_s) = p_{\gamma }({\textbf{x}}_t, {\textbf{x}}_s) p_{\beta }({\textbf{y}}_s \vert {\textbf{x}}_s) p_{\alpha }({\textbf{y}}_t \vert {\textbf{y}}_s, {\textbf{x}}_t), \end{aligned}$$
(9)

where \(\tau = \{\gamma , \beta , \alpha \}\) represents the model parameters. The third term in Eq. (9) can be interpreted as the probability of the model on target samples, the second term is the classification model on source samples, and the first term is the joint probability distribution over data samples, which can be defined as follows, by considering \({\textbf{z}}_t\) and \({\textbf{z}}_s\) as the latent variables to model the source and target distributions:

$$\begin{aligned} p_{{\gamma }}({\textbf{x}}_t,{\textbf{x}}_s) = \int p_{{\theta }}({\textbf{x}}_t,{\textbf{x}}_s\vert {\textbf{z}}_t,{\textbf{z}}_s)p_{{\eta }}({\textbf{z}}_t,{\textbf{z}}_s)d{\textbf{z}}_t d{\textbf{z}}_s, \end{aligned}$$
(10)

where finding the maximum likelihood of such joint distribution is generally intractable. Thus, we leverage variational inference for jointly modeling distribution. We assume joint variational posterior as \(q_{\phi }({\textbf{z}}_t, {\textbf{z}}_s\vert {\textbf{x}}_t,{\textbf{x}}_s)\), then the joint log-evidence lower bound (ELBO) can be derived as follows:

$$\begin{aligned} \begin{aligned} \log p_{{\gamma }}({\textbf{x}}_t,{\textbf{x}}_s)&\ge {\mathbb {E}}_{q_{{\phi }}({\textbf{z}}_t,{\textbf{z}}_s\vert {\textbf{x}}_t, {\textbf{x}}_s)}\big [\log p_{{\theta }}({\textbf{x}}_t, {\textbf{x}}_s\vert {\textbf{z}}_t, {\textbf{z}}_s)\big ]\\&+{\mathbb {E}}_{q_{{\phi }}({\textbf{z}}_t,{\textbf{z}}_s\vert {\textbf{x}}_t, {\textbf{x}}_s)} \big [\log p_{{\eta }}({\textbf{z}}_t, {\textbf{z}}_s)\big ]\\&- {\mathbb {E}}_{q_{{\phi }}({\textbf{z}}_t,{\textbf{z}}_s\vert {\textbf{x}}_t, {\textbf{x}}_s)} \big [\log q_{{\phi }}({\textbf{z}}_t, {\textbf{z}}_s\vert {\textbf{x}}_t, {\textbf{x}}_s)\big ], \end{aligned} \end{aligned}$$
(11)

where the first expectation term is a reconstruction error, the second one refers to the joint prior distribution, and the third expectation term minimizes the entropy of variational posterior. The reconstruction term can be factorized \(p_{{\theta }}({\textbf{x}}_t, {\textbf{x}}_s\vert {\textbf{z}}_t,{\textbf{z}}_s) = p_{{\theta _t}}({\textbf{x}}_t\vert {\textbf{z}}_t, {\textbf{z}}_s) p_{{\theta _s}}({\textbf{x}}_s\vert {\textbf{z}}_s)\) by assuming the conditional independence between \({\textbf{x}}_t\) and \({\textbf{x}}_s\) given \({\textbf{z}}_t\). To simplify the third term on the right-hand side (RHS) of Eq. (11), we formulate a factorized variational posterior of the form \(q_{{\phi }}({\textbf{z}}_t, {\textbf{z}}_s\vert {\textbf{x}}_t,{\textbf{x}}_s) = q_{{\phi }_t}({\textbf{z}}_t\vert {\textbf{x}}_t) q_{{\phi }_s}({\textbf{z}}_s\vert {\textbf{x}}_s)\), which is consistent with the conditional independence assumption between latent space of one domain and the input space of the other. Also, we define \({\textbf{z}}_t = f({\textbf{z}}_s)\), which leads to factorization of joint prior as \(p_{\eta }({\textbf{z}}_t, {\textbf{z}}_s) = p_{\eta }({\textbf{z}}_t\vert {\textbf{z}}_s) p_{\eta }({\textbf{z}}_s)\). Taking all these terms into account and using the chain rule along with Eq. (11), we can derive the final ELBO loss as follows:

$$\begin{aligned} \begin{aligned} {{\mathcal L}_{\gamma }}(\theta _s, \theta _t, \phi _{s}, \phi _{t}, \eta )&= \lambda _{tr}\underbrace{{\mathbb {E}}_{q_{\phi _s}({\textbf{z}}_s\vert {\textbf{x}}_s)}\bigg [{\mathbb {E}}_{p_{\eta }({\textbf{z}}_t\vert {\textbf{z}}_s)}\big [\log p_{\theta _t}({\textbf{x}}_t\vert {\textbf{z}}_t)\big ]\bigg ]}_{({\mathcal {L}}_{tr})}\\&+ \lambda _{sr}\underbrace{{\mathbb {E}}_{q_{\phi _s}({\textbf{z}}_s\vert {\textbf{x}}_s)}\big [\log p_{\theta _s}({\textbf{x}}_s\vert {\textbf{z}}_s)\big ]}_{({\mathcal {L}}_{sr})} \\&+ \lambda _{kl}\underbrace{{\mathbb {E}}_{q_{\phi _{s}}({\textbf{z}}_s\vert {\textbf{x}}_s)}\big [\log p_{\eta }({\textbf{z}}_s)-\log q_{\phi _s}({\textbf{z}}_s\vert {\textbf{x}}_s)\big ]}_{({\mathcal {L}}_{kl})} \\&- \lambda _{f}\underbrace{{\mathbb {E}}_{q_{\phi _t}({\textbf{z}}_t\vert {\textbf{x}}_t)}\bigg [\log p(f^{-1}({\textbf{z}}_t))-\log \Big |\det \frac{\partial f^{-1}}{\partial {\textbf{z}}_t}\Big |\bigg ]}_{({\mathcal {L}}_f)}, \end{aligned} \end{aligned}$$
(12)

where \(\lambda = (\lambda _{sr}, \lambda _{tr}, \lambda _{kl}, \lambda _{f})\) are regularization parameters. Further details about the mathematical derivation of this loss can be found in the Appendix A) An illustration of our general framework is provided in Fig. 1. It consists of one feature extractor (encoder) \({g}_{s}({\textbf{x}}_s; \phi _{s})\) to learn posterior distribution for the source domain. We rely on variational inference (VI) to find an approximation \({g}_{s}({\textbf{x}}_s; \phi _{s}) = q_{\phi _s}({\textbf{z}}_s\vert {\textbf{x}}_s)\) for the true latent posterior distribution \(p_{\theta } ({\textbf{z}}_s\vert {\textbf{x}}_s)\), which is parameterized by a deep neural network with parameters \(\phi _{s}\). Therefore, the representation space of the source domain is forced to be Gaussian with distribution \(\textstyle {\displaystyle {\mathcal {N}} ({\textbf{z}}_s \vert \mu _{\phi _s}({\textbf{x}}_s), \sigma ^2_{\phi _s}({\textbf{x}}_s))}\), which can be used as a prior to model target representation.

Fig. 1
figure 1

Illustration of the proposed unified framework for UDA

For the target domain, on the other hand, an invertible neural network constructed by affine coupling layers, which facilitates to compute the Jacobian \(\textstyle {J = \frac{\partial f^{-1}}{\partial {\textbf{z}}_t}}\), has been utilized to estimate the density of target encoded samples \(\textstyle {{g}_{t}({\textbf{x}}_t; {\phi }_{t}) = q_{{\phi }_t}({\textbf{z}}_t\vert {\textbf{x}}_t)}\).

Let \({\textbf{z}}_s\) with dimension d be the encoded latent variable for unit Gaussian distribution \(p({\textbf{z}}_s)\) and let \({\textbf{z}}_t\in {\mathcal {Z}}_t\) be an observation from an unknown target distribution \({\textbf{z}}_t \sim p({\textbf{z}}_t)\). Given \(f_{{\eta }}: {\textbf{z}}_s \rightarrow {\textbf{z}}_t\), we define a model \(p_{{\eta }}({\textbf{z}}_t)\) with parameters \(\eta\) on \({\mathcal {Z}}_t\), and we can compute the negative log-likelihood (NLL) of \({\textbf{z}}_t\) by the change of variable formula. For a single unlabeled target datapoint, the unsupervised objective can be derived as follows:

$$\begin{aligned} \begin{aligned} \log p_{\eta }({\textbf{z}}_t) =&\displaystyle {{\mathcal {L}}}_f(f_{\eta }({\textbf{z}}_t)) = (\log p_{\eta }(f_{\eta }^{-1}({\textbf{z}}_t)) + \log \Big |\det \left(\displaystyle \frac{\partial f_{\eta }^{-1}({\textbf{z}}_t)}{\partial {\textbf{z}}_t}\right)\Big |, \end{aligned} \end{aligned}$$
(13)

where \(p_{{\eta }}\) is the prior distribution for the source domain. The minimization of this loss helps to generate a mapping of each unlabeled target sample into the corresponding embedding space.

Note that \({{\mathcal L}_{{\gamma }}}\) in Eq. (12) has five terms, including target reconstruction, source reconstruction, a prior term for source domain, which can be learned with another invertible network, the entropy of source dataset, and a mapping objective from target to source. In our method, the transfer properties are enforced by reconstructing and translating input images. The range of the source representation part has been restrained to be Gaussian.

The second term in Eq. (9) is a predictive function on source datasets. Assuming that \(p_{\theta }({\textbf{z}}_s\vert {\textbf{x}}_s)\) can be approximated by the variational posterior \(q_{\phi _s}({\textbf{z}}_s\vert {\textbf{x}}_s)\), we have:

$$\begin{aligned} \begin{aligned} p_{\beta }({\textbf{y}}_s \vert {\textbf{x}}_s)&= \int p_{\omega }({\textbf{y}}_s \vert {\textbf{z}}_s)p_{\theta }({\textbf{z}}_s\vert {\textbf{x}}_s) d{\textbf{z}}_s \approx {\mathbb {E}}_{q_{\phi _s}({\textbf{z}}_s \vert {\textbf{x}}_s)}[p_{\omega }({\textbf{y}}_s \vert {\textbf{z}}_s)]. \end{aligned} \end{aligned}$$
(14)

The predictive function \({\varphi }_{\omega }:{\mathcal {Z}}_s \rightarrow {\mathcal {Y}}_s\) enforces separability between classes,

$$\begin{aligned} {\mathcal {L}}_{{\beta }}({\omega }; {\textbf{z}}_s) =- {\mathbb {E}}_{{\textbf{z}}_s \sim q_{{\phi _{s}}}({\textbf{z}}_s \vert ,{\textbf{x}}_s)}[y_{s}^T \ln {\varphi }_\omega ({\textbf{z}}_s)]. \end{aligned}$$
(15)

As for the third term in Eq. (9), since we have no labels for the target domain, to learn a discriminative target representation, we follow (Shu et al., 2018; Kumar et al., 2018), and apply low-density and smoothness assumptions by assuming a conditional entropy (CE) minimization and virtual adversarial training (VAT).

$$\begin{aligned}{} & {} \begin{aligned} {\mathcal {L}}_{{ce}}({\textbf{z}}_t;{\omega }) = -{\mathbb {E}}_{{{\textbf{z}}_t} \sim q_{{\phi }_t}({\textbf{z}}_t\vert {\textbf{x}}_t)}[{\varphi _{{\omega }}({\textbf{z}}_t)}^T \ln \varphi _{{\omega }}({\textbf{z}}_t)] \end{aligned} \end{aligned}$$
(16)
$$\begin{aligned}{} & {} \begin{aligned} {\mathcal {L}}_{{vat}}({\textbf{z}}_t;{\omega })= {\mathbb {E}}_{{{\textbf{z}}_t} \sim q_{{\phi }_t}({\textbf{z}}_t\vert {{\textbf{x}}}_t)}\big [\max _{\left\Vert r\right\Vert \le \epsilon } D_{KL}(\varphi _{{\omega }}({\textbf{z}}_t) \vert \vert \varphi _{{\omega }}({\textbf{z}}_t+r))\big ]. \end{aligned} \end{aligned}$$
(17)

While the conditional entropy minimization (Eq. (16)) forces the predictor to be confident in the unlabeled target data by pushing the decision boundaries away from the target data, VAT loss (Eq. (17)) enforces prediction consistency within the neighborhood of training samples. Note that VAT can be applied on both or either of the source and target distributions.

The overall objective of our proposed MFF to be minimized is given by:

$$\begin{aligned} \begin{aligned} \min _{{\theta }_s, {\theta }_t, {\phi }_{s}, {\phi }_{t}, {\eta }, {\omega }} \quad& {{\mathcal L}_{{\gamma }}}({\theta }_{s}, {\theta }_{t}, {\phi }_{s}, {\phi }_{t}, {\eta }) + \lambda _{s}{\mathcal {L}}_{{\beta }}({\omega }; {\textbf{z}}_s)\\ {}&+ \lambda _{t}({\mathcal {L}}_{{ce}}({\textbf{z}}_t;{\omega }) + {\mathcal {L}}_{{vat}}({\textbf{z}}_t;{\omega })), \end{aligned} \end{aligned}$$
(18)

The objective is overly complex to train the model with. Hence, we further assume a shared encoder (\({\theta }_{s} = {\theta }_{t} = {\theta }\)), and a shared decoder (\({\phi }_{s} = {\phi }_{t} = {\phi }\)) for the source and target domains. Moreover, we let the translation loss, i.e., \({\mathcal {L}}_{tr}\) in Eq. (12), to be learned adversarially by employing a discriminator d with extra parameter \(\theta _{d}\). Therefore, the overall learning objective will be redefined as follows:

$$\begin{aligned} \begin{aligned} \min _{{\theta }, {\phi }, {\eta }, {\omega }} \max _{\theta _{d}} \quad& {{\mathcal L}_{{\gamma }}}({\theta }, {\phi }, {\eta }) + \lambda _{s}{\mathcal {L}}_{{\beta }}({\omega }; {\textbf{z}}_s) + \lambda _{t}({\mathcal {L}}_{{ce}}({\textbf{z}}_t;{\omega }) + {\mathcal {L}}_{{vat}}({\textbf{z}}_t;{\omega })), \end{aligned} \end{aligned}$$
(19)

where \(\mu =({\theta }, {\phi }, {\eta , \omega })\) are all parameters to be learned, and \(\lambda = (\lambda _{sr}, \lambda _{tr}, \lambda _{kl}, \lambda _{f}, \lambda _{s}, \lambda _{t})\) are regularization parameters. To simplify the training objective, we also tried to pre-train the flow model with supposedly Gaussian prior by using target auto-encoder (Xiao et al., 2019).

6 Experiments

In this section, we first present the experimental setup, and then we provide details of the implementation of our model, followed by the results, where we compare our model with the SOTA methods in UDA and qualitative analysis of the method.

6.1 Setup

6.1.1 Data sets

To demonstrate the performance of our proposed method, we present our model evaluation on three commonly used digit datasets for UDA: MNIST (LeCun, 1998), SVHN (Netzer et al., 2011), and USPS (Le Cun et al., 1990). For general object classification tasks, we rely on CIFAR-10 (Krizhevsky & Hinton, 2009), STL-10 (Coates et al., 2011), and office-31 (Saenko et al., 2010). Additionally, we evaluate our model for adaptation tasks on the large-scale dataset. In particular, we test on VisDA-2017 (Peng et al., 2017) for the image classification task.

6.1.2 Baselines

We primarily compare our proposed MapFlow with three baselines: ALDA (Chen et al., 2020), MDD+Implicit (Jiang et al., 2020), and VADA (Shu et al., 2018). We also show the results of several other recently proposed UDA models for comparison, including Maximum Classifier Discrepancy (MCD) (Saito et al., 2018), Joint Adaptation Network (JAN) (Long et al., 2017), Self-Ensembling(S-En) (French, 2017), and Conditional Domain Adversarial Networks (CDAN) (Long et al., 2018). For a fair comparison, the results are reported from the original papers if available. For all the experiments, we will report the results in terms of accuracy for each domain shift, repeating the experiments 3 times and averaging the results.

6.2 Implementation

6.2.1 Architecture

In order to make fair comparisons for digits and CIFAR10/STL datasets, we adopt the architectural components, including the classifier network, the feature extractor, and the discriminator used in DIRT-T (Shu et al., 2018). Similarly, we use a small architecture for the digits UDA tasks and a larger architecture for UDA experiments between CIFAR-10 and STL-10. For office-31 and VisDA 2017 datasets, we employ ResNet-50 (He et al., 2016), which is pre-trained on ImageNet (Russakovsky et al., 2015), as the feature extractor. The discriminator network is composed of two fully connected layers with dropout (Ganin et al., 2016). Note that our architecture is slightly different as we include an invertible feature transform to the classifier network; however, the invertible network only adds a small parameter overhead on the shared feature extractor and classifier (less than 4%). For the invertible network applied on latent variables, we use Glow architecture (Kingma & Dhariwal, 2018) with 4 affine coupling blocks, where each block contains 3 fully connected layers, each with 256 or 512 hidden units depending on the dataset.

6.2.2 Training settings and hyper-parameters

For digits and CIFAR10/STL datasets, we implement adversarial training via alternating updates (Shu et al., 2018), and train the model using Adam optimizer (Kingma & Ba, 2014) with learning rate \({10}^{-3}\) decaying by a factor of 2 after 200 epochs.

For office-31 and VisDA-2017 datasets, we follow (Chen et al., 2020) and all the protocols, including optimizer and learning rate strategy. We optimize the model using Stochastic Gradient Descent (SGD) optimizer with a momentum of 0.9 and an adjusted earning rate \(\eta _p = \eta _0 (1+\alpha q) \gamma\), where \(\eta _0= 0.01\), \(\alpha = 10\), \(\gamma = 0.75\), and q is the training progress linearly decreasing from 1 to 0. Note that we set the learning rates of the classifier and discriminator to be 10 times that of the generator.

As for hyper-parameters \((\lambda _{sr}, \lambda _{tr}, \lambda _{kl}, \lambda _{s}, \lambda _{t})\), we tune the values for each dataset using cross validation. We observed that extensive hyper-parameter tuning is not required to obtain top-performance results. Accordingly, we limit the hyper-parameter search for each task to \(\lambda _{sr} = \lambda _{tr}= \{10 ^{-1}, 10^{-2}\}, \lambda _{kl} = \{1, 10^{-1}\}, \lambda _{s} = \{1\}, \lambda _{t} = \{0, 1, 10^{-1}, 10^{-2}\}\)

6.3 Results

Table 1 Test accuracy (\(\%\)) on standard domain adaptation benchmarks

Table 1 summarizes the results of the average accuracy (\(\%\)) on the standard classification benchmarks for UDA, such as digits, CIFAR-10, and STL data sets, compared with SOTA methods. For fair comparison, we resize all images to \(32 \times 32 \times 3\) (except in case of adaptation from USPS to MNIST) and apply instance normalization (Shu et al., 2018) to input images. Below, we present a brief analysis of the results in Table 1.

USPS\(\varvec{\rightarrow }\)MNIST: although USPS contains a smaller training set than MNIST, domain discrepancy between these two datasets is relatively small, and we could achieve high performance in USPS \(\rightarrow\) MNIST.

MNIST\(\varvec{\leftrightarrow }\)SVHN: for the adaptation task SVHN \(\rightarrow\) MNIST, we modify the dimension of MNIST to \(32 \times 32\) of SVHN, with three channels. This adaptation problem is easily solved When the proposed MapFlow is applied. Our method could demonstrate a performance similar to the SOTA DTA (Lee et al., 2019) on MNIST. The reverse problem, the adaptation task MNIST \(\rightarrow\) SVHN, can be regarded as the most challenging case in digit datasets, as MNIST has a considerably lower dimensionality than SVHN. Experiments show that MapFlow could achieve state-of-the-art results on this adaptation task. On average, MapFlow achieved \(\mathbf {4.8\%}\) improvements compared with the method of DIRT-T (Shu et al., 2018). The improvement shows the importance of relaxed invariant representation.

CIFAR-10\(\varvec{\leftrightarrow }\)STL-10: in both adaptation directions, results in Table 1 show that MapFlow is slightly better than the SOTA, which we believe is due to the relatively smaller training set for STL and the existing imbalance between two datasets.

Table 2 Test Accuracy (%) on Office-31 adaptation tasks for unsupervised domain adaptation (ResNet-50)

The results in Table 2 show again the superiority of our approach compared to other recently proposed methods on Office-31 datasets. We evaluate MapFlow across six UDA tasks: \(\text {A} \rightarrow \text {W}\), \(\text {W} \rightarrow \text {D}\), \(\text {D} \rightarrow \text {W}\), \(\text {A} \rightarrow \text {D}\), \(\text {D} \rightarrow \text {A}\), and \(\text {W} \rightarrow \text {A}\). Our method surpasses the baselines in 3 out of 6 pairs of adaptation tasks for Office-31. We further demonstrate the generalization ability of the proposed method by conducting additional experiments on VisDA-2017. In our experiments, we observed a gain of 0.6 points over the baseline (Chen et al., 2020), confirming the flexibility of MapFlow and its applicability across UDA tasks. The SOTA results with ResNet-50 are reported in Table 3.

Table 3 Test accuracy (%) on VisDA-2017 for unsupervised domain adaptation (ResNet-50)

6.4 Ablation studies

To examine the relative contribution of the invertible network in MapFlow, we conduct ablations on the adaptation tasks presented in Table 1, with and without the loss term of E.q 13. The results are reported in Table 4, where the “no-nf” subscript denotes the removal of the NF component. We observe that when the loss, including the term for the log determinant of Jacobian, is applied (MapFlow), our method demonstrates a significant improvement over MapFlow\(_{\mathrm {no-nf}}\) and previous works. These results demonstrate the effectiveness of the flow model in relaxing invariance enforcement.

Table 4 Test accuracy (\(\%\)) on standard domain adaptation benchmarks in ablation experiment

6.5 Analysis

6.5.1 Qualitative analysis

To further analyze the relaxed invariant representation, we visualize the non-adapted and adapted feature representations generated from the last hidden layer of the model on SVHN \(\rightarrow\) MNIST UDA task using t-SNE (Van der Maaten & Hinton, 2008). As illustrated in Fig. 2, source-only training or Non-adapted model shows strong clustering of the SVHN samples and performs poorly on MNIST (Fig. 2a). MapFlow delivers higher feature discriminability in the target domain by keeping each class well separated without enforcing the target clusters to be completely aligned with the source domain (Fig. 2c).

Fig. 2
figure 2

t-SNE visualization of the last hidden layer for SVHN \(\rightarrow\) MNIST task of a Non-adapted, b VADA model, c Adapted (MapFlow)

6.5.2 Target error bound

The learning theory of UDA was initially proposed by Ben-David (Ben-David et al., 2010) and is summarized in Theorem 1.

Theorem 1

(Ben-David et al., 2010) Let \({\mathcal {H}} = \{\varphi \circ g: \varphi \in \Phi ,g \in {\mathcal {G}} \}\) be the hypothesis space, where \({\mathcal {G}}\) and \(\Phi\) are considered to be the set of representations and predictive functions respectively, and let \(\varepsilon (h)\) be the risk for \(h \in {\mathcal {H}}\), and \(\varepsilon (h, h')\) be the risk for \((h, h') \in {\mathcal {H}}^2\).

$$\begin{aligned} \varepsilon _t(g \circ \varphi ) \le \underbrace{\varepsilon _s(g \circ \varphi )}_{\textcircled {1}} + \underbrace{\frac{1}{2} d_{{\mathcal {H}} \Delta {\mathcal {H}}}(p_{s}, p_{t})}_ {\textcircled {2}}+ \underbrace{\Psi (h)}_{\textcircled {3}}, \end{aligned}$$
(20)

where \(d_{{\mathcal {H}} \Delta {\mathcal {H}}}\) in the second term denotes \({\mathcal {H}} \Delta {\mathcal {H}}\) distance between source and target domains, \(\Psi (h)\) is the shared error of the ideal joint hypothesis, and

$$\begin{aligned} d_{{\mathcal {H}} \Delta {\mathcal {H}}}(p_{s}, p_{t}) = 2\sup \limits _{h, h'\in {\mathcal {H}}} \Big |\varepsilon _s(h, h')-\varepsilon _t(h, h')\Big |\end{aligned}$$
(21)
$$\begin{aligned} \Psi (h) = \inf \limits _{h\in {\mathcal {H}}} \varepsilon _s(h)+\varepsilon _t(h). \end{aligned}$$
(22)

We analyze the second term, domain discrepancy, and the third term, ideal joint hypothesis error, of the target error bound, as formulated in Eq. (20), on SVHN \(\rightarrow\) MNIST task.

Fig. 3
figure 3

\({\mathcal {A}}\)-distance, and \(\Psi\), evaluated on SVHN \(\rightarrow\) MNIST task

Domain Discrepancy Domain discrepancy can be estimated approximately by \({\mathcal {A}}\)-distance (Ben-David et al., 2010), defined as \({\mathcal {A}} = 2(1-2\epsilon )\), where \(\epsilon\) denotes the error of a domain classifier trained to discriminate the source and target representations. As illustrated in Fig. 3a, MapFlow minimizes domain discrepancy more significantly than standard domain adversarial training but does not as much as VADA (Shu et al., 2018) method does, implying a relaxed invariance.

Ideal Joint Hypothesis We evaluate the Ideal Joint Hypothesis by training an MLP classifier with two layers on the adapted features from both target and source domains, as suggested in (Chen et al., 2019). As shown in Fig. 3b, MapFlow reduces the joint error, which indicates that our method improves the feature discriminability.

7 Conclusion and future work

In this paper, a novel relaxed invariant representation learning is presented for unsupervised domain adaptation. In standard domain invariance learning, the transferability of feature representations is enhanced at the expense of its discriminability. Thus, we propose a general framework to relax invariance enforcement in representation space. Our method aims at encouraging a combination of domain invariance and specificity to enhance the target discriminability. The framework relies on normalizing flow to learn a transformation between the distribution of target and source domains in representation space. In fact, normalizing flow maps a complex target latent distribution into a well-clustered latent source distribution through a sequence of invertible functions. We mathematically derived a variational lower bound for the probability distribution changing across domains and showed the consistency of the lower bound with the relaxed invariance assumption. Through extensive experiments, our approach demonstrates its superiority to other methods based on invariant representations on several public UDA datasets, validating our analysis. In future work, we intend to extend our model to work in the presence of label and conditional shifts for domain adaptation.