MapFlow: latent transition via normalizing flow for unsupervised domain adaptation

Askari, Hossein; Latif, Yasir; Sun, Hongfu

doi:10.1007/s10994-023-06357-2

MapFlow: latent transition via normalizing flow for unsupervised domain adaptation

Open access
Published: 12 July 2023

Volume 112, pages 2953–2974, (2023)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

MapFlow: latent transition via normalizing flow for unsupervised domain adaptation

Download PDF

2328 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Unsupervised domain adaptation (UDA) aims at enhancing the generalizability of the classification model learned from the labeled source domain to an unlabeled target domain. An established approach to UDA is to constrain the classifier on an intermediate representation that is distributionally invariant across domains. However, recent theoretical and empirical research has revealed that relying only on invariance fails to guarantee a small target error, thus making equality in the distribution of representations unnecessary. In this paper, we propose to relax invariant representation learning by finding a general relationship between the source and target representations, which allows an interchange of the more discriminative domain information. To this end, we formalize the MapFlow framework, which explicitly constructs an invertible mapping between the target encoded distribution and variationally induced source representation. Empirical results on public benchmark datasets show the desirable performance of our proposed algorithm compared to state-of-the-art methods.

Knowledge Distillation: A Survey

Article 22 March 2021

Visual Out-of-Distribution Detection in Open-Set Noisy Environments

Article 16 June 2024

Towards Task Sampler Learning for Meta-Learning

Article 17 June 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Deep learning (DL) is currently the most widespread and successful methodology in machine learning (Voulodimos et al., 2018; Litjens et al., 2017). The epitome of such success is the outstanding performance of DL models on image classification tasks (Tan & Le, 2019; Chu et al., 2021). The accuracy of deep classifiers on the ImageNet Large Scale Visual Recognition (Krizhevsky et al., 2012) challenge has to date appreciated to $\%97$ (Tan & Le, 2019), even surpassing human-level performance. However, they perform poorly when tested on out-of-distribution data, preventing them from being safely deployed in real-world settings. As a result, this tends to require a massive amount of human and computational resources to annotate the test data. Unsupervised domain adaptation (UDA) seeks to facilitate the burden of the annotation process by transferring predictive models learned from the labeled training (source) domain to the unlabeled test (target) domain.

To tackle UDA, a broad spectrum of methods has been proposed (Sun & Saenko, 2016; Ganin et al., 2016; Cui et al., 2020; Liu & Tuzel, 2016; Murez et al., 2018; Hoffman et al., 2018; Saito et al., 2017; Sun et al., 2019; Lee et al., 2019; Liu et al., 2019, 2021). The prevailing approach is to train a classifier on the source domain while finding a relationship between the source and target domains, primarily by matching their distributions in the representation space (also known as domain-invariant representation). Such invariant representations have been achieved via matching distribution properties, such as statistical moments (Long et al., 2015; Sun & Saenko, 2016; Sun et al., 2017) and supports (Tong et al., 2022), or matching full distribution (Ganin et al., 2016; Saito et al., 2017; Usman et al., 2020; Nguyen et al., 2021; Courty et al., 2017). A notable, top-performing example of the latter is adversarial domain adaptation (Ganin & Lempitsky, 2014; Ganin et al., 2016; Saito et al., 2017), which yielded remarkable performance gains in UDA.

In the seminal work of domain-adversarial neural network (DANN) (Ganin et al., 2016), a discriminator is trained to distinguish between the representations of the source and target domains, while a generator learns to deceive the discriminator by generating domain-invariant representations. Despite the impressive results that DANN gained, it suffers from a critical restriction. The arbitrary transformation of the generator is prone to produce ambiguous target features that may even be specific to the source domain. Consequently, it may deteriorate the target feature discriminability, though it enhances the feature transferability. A plethora of variants is proposed to ameliorate the discriminability of target features by (i) adjusting the classifier’s decision boundaries (Saito et al., 2018; Lee et al., 2019; Zhang et al., 2019; Shu et al., 2018; Chen et al., 2020; Jiang et al., 2020), (ii) regularizing the norm of invariant representation (Chen et al., 2019; Xu et al., 2019; Jin et al., 2020) or conditional prediction features (Cui et al., 2020; Jin et al., 2020), (iii) tackling mode collapse issue (Long et al., 2018; Pei et al., 2018), (iv) encouraging task-related distribution matching (Wei et al., 2021; Jin et al., 2020), and (iv) utilizing domain-specific variations when separating the domain-invariant representations (Bousmalis et al., 2016; Cui et al., 2020; Gong et al., 2019; Cui et al., 2020).

Nevertheless, all the advances mentioned above disregarded domain-specific characteristics and relied only on the invariant features, which tends to be insufficient for a well-performed classification. There are often variations associated with each domain that is unique and can contribute significantly to in-domain classification performance when leveraged.

Based on this observation, in this paper, we propose to relax invariance enforcement, a significant cause for inadaptability (Bouvier et al., 2019), by exploiting both domain-specific and invariant knowledge in capturing the interrelation between source and target domains in the representation space. Accounting for the fact that a representation space may be more suitable for the target domain than it is for the source domain, we aim to find the relationship between source and target representations by learning a transformation from source to target domain in the feature space. Hence, we propose MapFlow, a general framework to relax domain invariance. MapFlow framework (MFF) relies on normalizing flow to learn a bijective, non-linear transformation between the encoded target distribution and a flexible latent prior induced directly from the source latent space by variational inference.

MFF enables us to explicitly model target latent knowledge by efficiently regularizing the log determinant of the Jacobian. The maximization of the determinant of Jacobian helps to alleviate the distributional divergence by establishing a geometrical relationship between the source and target representations. Explicit latent distribution modeling has been explored for UDA (Liu et al., 2017; Grover et al., 2019; Zhu et al., 2019), where the source and target latent distribution are modeled as predefined parametric distributions. However, different from those methods, normalizing flow (NF), a specific type of INN with an easily computable determinant of the Jacobian, is employed to model the likelihood of complex target latent distribution. In addition, despite adversarial domain adaptation in representation space that may fail to achieve multimodal alignment, MFF can preserve the multimodal structure of target latent space, which is suitable for discriminative mapping or alignment.

The contribution of our work is as follows. First, we present a motivating scenario to relax the excessive invariance in representation learning for UDA and propose a relaxed-invariant objective in representation learning that overcomes the limitations of standard objectives. In particular, from a probabilistic perspective, we mathematically derive a lower bound on the joint probability distribution of the source and target domains as a unified framework and general objective for UDA called MapFlow. MapFlow enables us to (1) exploit a more complex distribution for the target domain for which we can model the density when the source latent distribution is known and (2) leverage the relationship between the two domains rather than enforcing them to follow a simple and strict constraint (e.g., to be Gaussian distributed). Second, we empirically show that our proposed MapFlow loss improves the performance for the discriminability of the target domain.

The rest of the paper is organized as follows: the related work is detailed in Sect. 2. The preliminary concepts are presented in Sect. 3. The motivational insight is investigated in Sect. 4. The proposed approach is discussed in Sect. 5, followed by an experimental evaluation in Sect. 6, tackling the image classification performance. The Sect. 7 concludes the paper.

2 Related works

2.1 Unsupervised domain adaptation (UDA)

The success of supervised machine learning relies on the availability of a large amount of annotated training data from different domains, which is often cost-ineffective to collect and unrealistic in many cases. Unsupervised domain adaptation (UDA) aims to overcome this problem by transferring discriminative features extracted from the label-abundant source domain to the unlabelled target domain. A variety of methods has been proposed in the literature to attain adaptation. Apart from improvements in architecture designs (Li et al., 2016; Maria Carlucci et al., 2017; Wang et al., 2019, 2020; Xu et al., 2021) and optimization strategies (Wei et al., 2021; Acuna et al., 2022; Rangwani et al., 2022), these methods can be generally divided into three categories by resorting to the fundamental questions of what to adapt-the data or the model (Liang et al., 2020; Huang et al., 2021; Kundu et al., 2022)-when to adapt- during training or testing (Wang et al., 2020; Gao et al., 2022)-and how to adapt-by learning a domain-invariant representation (Sun & Saenko, 2016; Ganin et al., 2016) or by domain mapping (Liu & Tuzel, 2016; Liu et al., 2017; Murez et al., 2018; Gong et al., 2019; Hoffman et al., 2018).

The difficulty in UDA is how to resolve the distributional shift between the source and target domains, which is mathematically characterized by the difference in joint probability distribution $p_t({\textbf{x}}, {\textbf{y}}) \ne p_s({\textbf{x}}, {\textbf{y}})$. The UDA problem is generally infeasible unless we make some assumptions as to how the test distribution may alter. One of the most common assumptions is covariate shift, which assumes that the distributional shift is merely caused by inconsistency in the feature space, i.e., $p_t ({\textbf{x}}) \ne p_s ({\textbf{x}})$. Importance Sampling is employed by Shimodaira (2000) to bridge the distributional gap via a weighting mechanism $\textstyle {w({\textbf{x}}) = \frac{p_t({\textbf{x}})}{p_t({\textbf{x}})}}$. However, the shift between two domains with high dimensional data, such as images, stems from non-overlapping supports, thus requiring unbounded weights. Ben-David et al. (2007) theoretically analyzed that the non-overlapping supports can be reconciled by learning a representation that exhibits invariance across domains, leading to numerous algorithms for UDA (Sun & Saenko, 2016; Sun et al., 2017; Ganin et al., 2016; Saito et al., 2017; Usman et al., 2020; Saito et al., 2018; Lee et al., 2019; Zhang et al., 2019; Shu et al., 2018; Chen et al., 2020; Jiang et al., 2020; Chen et al., 2019; Xu et al., 2019; Cui et al., 2020; Jin et al., 2020; Long et al., 2018; Pei et al., 2018; Wei et al., 2021; Jin et al., 2020; Bousmalis et al., 2016; Cui et al., 2020; Gong et al., 2019; Cui et al., 2020). This approach increases the transferability of features since high transferability is close to an invariant representation, while low transferability implies more domain-specific features. However, not only may invariance learning substantially deteriorate the adaptability as conclusively proved (Wu et al., 2019; Johansson et al., 2019; Zhao et al., 2019; Arjovsky et al., 2019; Bouvier et al., 2019), but also potentially neglect domain-specificity that can be incredibly beneficial in target feature discriminability. Therefore, in this paper, we aim to relax restrictive invariance by preserving domain-specific properties enforced by reconstruction.

2.2 Normalizing flow

A normalizing flow (NF) (Papamakarios et al., 2021) learns to transform an unknown, complex distribution to a simple distribution by a well-designed invertible network. NF models have been applied to several machine learning tasks, including image generation (Dinh et al., 2016; Kingma & Dhariwal, 2018), semi-supervised learning (Izmailov et al., 2020), inverse problems (Ardizzone et al., 2018), distribution matching (Usman et al., 2020), and domain adaptation (Gong et al., 2019; Grover et al., 2019; Sagawa & Hino, 2022). For example, the log-likelihood ratio minimizing flows (LRMF) (Usman et al., 2020) leverages invertible flow networks and density estimation for distribution matching without adversarial training and defines a new metric based on the log-likelihood ratio. The density model is not fixed and is trained to fit the mixture $\textstyle {\frac{1}{2} p({{{\textbf{z}}}_s}) + \frac{1}{2} p({{{\textbf{x}}}_t})}$. As for domain adaptation, Gong et al. (2019) proposed domain flow (DLOW) to generate multiple intermediate domains along the data manifolds between the source and target domains using normalizing flow to reduce the domain shift. Sagawa and Hino (2022) used NF to generate nonadjacent intermediate domains between the source and target domains to solve UDA based on a gradual self-training idea (Kumar et al., 2020). AlignFlow (Grover et al., 2019) trains two normalizing flows separately to map the source and the target domain to a common latent space with a Gaussian distribution and employ adversarial discriminators to execute further distribution alignment. In contrast, in this paper, we use NF model to transform discriminative source distribution to the target one in the representation space.

3 Preliminaries

3.1 Notation and problem definition

Let ${\mathcal {X}}$ and ${\mathcal {Y}}$ be the input and output space, respectively. ${\mathcal {Z}}$ is the representation space generated from ${\mathcal {X}}$ by a feature transformation $g:{\mathcal {X}} \rightarrow {\mathcal {Z}}$. Accordingly, we use X, Y, Z as random variables from spaces ${\mathcal {X}}$, ${\mathcal {Y}}$, ${\mathcal {Z}}$, and let lower-case variables ${\textbf{x}}$, ${\textbf{y}}$, and ${\textbf{z}}$ denote the corresponding sample values respectively. We also define an output labeling function $\varphi : {\mathcal {Z}} \rightarrow {\mathcal {Y}}$ and a composite predictive transformation $g \circ \varphi$. Given $n_s$ labeled samples of source domain $\{({\textbf{x}}_i,{\textbf{y}}_i) \mid {\textbf{x}}_i \in {{\mathcal {X}}}_s, {\textbf{y}}_i\in {\mathcal {Y}}_s, i= 1, 2, \ldots n_s\}$, with $({\textbf{x}}, {\textbf{y}}) \sim p_{s}(X, Y)$, and unlabelled samples of target domain $\{({\textbf{x}}_i) \mid {\textbf{x}}_i\in {\mathcal {X}}_t, i= 1, 2, \ldots n_t\}$, with ${\textbf{x}}\sim p_{t}(X)$, UDA aims to transfer the predictive knowledge learned from the source domain to the target domain.

3.2 Normalizing flow for transformation

The normalizing flow (Dinh et al., 2016; Kingma & Dhariwal, 2018) is a likelihood-based generative model defined as an invertible mapping $F:{\mathcal {X}} \rightarrow {\mathcal {Z}}$ from the observed space ${\mathcal {X}}$ to the latent space ${\mathcal {Z}}$. The distribution of the observed variable can be modeled by applying a chain of invertible transformations, which is composed of a sequence of invertible functions $f = f_1 \circ f_2 \circ \ldots \circ f_L:\mathbb R^{d} \rightarrow {\mathbb {R}}^{d}$ with inverse $F = f^{-1}$, on random latent variables with known distribution ${\textbf{z}}\sim p(Z)$. Based on the change of variables formula, the probability distribution of the transformed random variable can be written as follows:

$$\begin{aligned} \begin{aligned} p_{X}({{\textbf{x}}})&= \;p_{Z}(f^{-1}({\textbf{x}})) \;\Big |\det (J_{f^{-1}}({\textbf{x}})) \Big |=\; p_{Z}(f^{-1}({\textbf{x}})) \prod _{l=1}^{L} \;\Big |\det (J_{{f_{l}}^{-1}}({\textbf{h}}_l)) \Big |, \end{aligned} \end{aligned}$$

(1)

where $J_{f^{-1}}({\textbf{x}}) = {\partial f^{-1}({\textbf{x}})}/{\partial {\textbf{x}}}$ is the Jacobian of $f^{-1}$ with respect to ${\textbf{x}}$, det($\cdot$) denotes the determinant, and ${\textbf{h}}_l$ denotes the output of intermediate mapping $f_{l}$, with ${\textbf{h}}_1={\textbf{x}}$ and ${\textbf{h}}_L=f_L({\textbf{z}})$. The mapping $F({\textbf{x}})$ is characterized by a neural network with an architecture designed to ensure the invertibility and efficient computation of determinants. We train the model by computing the negative log-likelihood of the training data $D=\{{\textbf{x}}_i\}_{i=1}^{N}$ with respect to the parameters ${\eta }$.

$$\begin{aligned} {\eta }^{*} = \mathop {\textrm{argmax}}\limits _{\eta } {\mathcal {L}}, \; {\mathcal {L}} = - \frac{1}{|D |} \sum _{{\textbf{x}}\in D} \log p({\textbf{x}}; \eta ) \end{aligned}$$

(2)

Affine Coupling Layer is a powerful reversible transformation introduced in Dinh et al. (2016). Based on Dinh et al. (2016), the D dimensional input data ${\textbf{z}}$ is partitioned into two vectors ${\textbf{z}}_1 = {\textbf{z}}_{1:d}$ and ${\textbf{z}}_2 = {{\textbf{z}}}_{d+1:D}$ with $d < D$. The output of one affine coupling layer is given by ${\textbf{y}}_1 = {\textbf{z}}_1$, ${\textbf{y}}_2 = {\textbf{z}}_2 \odot \exp (s({\textbf{z}}_1)) + t({\textbf{z}}_1)$ where s and t represent functions from ${\mathbb {R}}^d \rightarrow {\mathbb {R}}^{D-d}$ and $\odot$ is the Hadamard product. The inverse of the transformation is given by ${\textbf{z}}_1 = {\textbf{y}}_1$, ${\textbf{z}}_2 = ({{{\textbf{y}}}_2} - t({{{\textbf{y}}}_1})) \odot \exp (-s({\textbf{y}}_1))$. The determinant of the Jacobian matrix of this transformation is explicitly derived as $\textstyle {\det \frac{\partial {{\textbf{y}}}}{\partial {{\textbf{z}}}}=\prod _{j=1}^d (\exp [s({\textbf{z}}_1)_j])}$.

4 Motivational insight

In this section, we motivate our approach by highlighting a key issue with invariant representation learning. Consider the error of a predictor $\varphi$ with respect to the true labelling function $\psi$ under distribution ${\mathcal {D}}$ with joint probability distribution $p({\textbf{x}},{\textbf{y}})$ to be as: $\textstyle {\varepsilon (g,\varphi ):= {\mathbb {E}}_{{\textbf{x}}\sim {\mathcal {D}}} \big [\big |\varphi (g({\textbf{x}})) - \psi ({\textbf{x}})\big |\big ]}$. Then for the target domain, we have:

$$\begin{aligned} \varepsilon _t(h) = \int p_t({\textbf{x}}) \big |\varphi (g({\textbf{x}}))-\psi ({\textbf{x}}) \big |d{\textbf{x}}, \end{aligned}$$

(3)

where $r({\textbf{x}}) = \big |\varphi (g({\textbf{x}}))-\psi ({\textbf{x}}) \big |$ is the risk for input ${\textbf{x}}$. Following the change of variable rule ($\frac{p({\textbf{x}})}{p({\textbf{z}})} = \frac{d{\textbf{x}}}{d{\textbf{z}}}$), we then have

$$\begin{aligned} \varepsilon _{t} (h) = \int p_{t}({\textbf{z}}) \mid \varphi ({\textbf{z}}) - \psi _{t}({\textbf{z}})\mid d{\textbf{z}}= \int p_{t}({\textbf{z}}) r_{t}({\textbf{z}}) d{\textbf{z}}. \end{aligned}$$

(4)

Similar to proof presented by (Ben-David et al., 2010), $\varepsilon _{t}(h)$ can be simply redefined as follows:

$$\begin{aligned} \begin{aligned} \varepsilon _{t} (h)&= \varepsilon _{t} (h) + \varepsilon _{s} (h) - \varepsilon _{s} (h) \\&= \varepsilon _{s} (h) + \int p_{t}({\textbf{z}}) \mid \varphi ({\textbf{z}}) - \psi _{t}({\textbf{z}})\mid d{\textbf{z}}\\ {}&- \int p_{s}({\textbf{z}}) \mid \varphi ({\textbf{z}}) - \psi _{s}({\textbf{z}})\mid d{\textbf{z}}\\&= \varepsilon _{s} (h) + \int p_{t}({\textbf{z}}) r_{t}({\textbf{z}}) d{\textbf{z}}- \int p_{s}({\textbf{z}}) r_{s}({\textbf{z}}) d{\textbf{z}}. \\ \end{aligned} \end{aligned}$$

(5)

Let $\int p_{t}({\textbf{z}}) r_{s}({\textbf{z}}) d{\textbf{z}}$ add to and subtract from Eq. (5),

$$\begin{aligned} \begin{aligned} \varepsilon _{t} (h)&= \varepsilon _{s} (h) + \int p_{t}({\textbf{z}}) r_{t}({\textbf{z}}) d{\textbf{z}}- \int p_{s}({\textbf{z}}) r_{s}({\textbf{z}}) d{\textbf{z}}\\&\qquad + \int p_{t}({\textbf{z}}) r_{s}({\textbf{z}}) d{\textbf{z}}- \int p_{t}({\textbf{z}}) r_{s}({\textbf{z}}) d{\textbf{z}}, \end{aligned} \end{aligned}$$

(6)

then we have:

$$\begin{aligned} \begin{aligned} \varepsilon _{t} (h)&= \underbrace{\varepsilon _{s} (h)}_{\textcircled {1}} + \underbrace{\int p_{t}({\textbf{z}}) (r_{t}({\textbf{z}}) - r_{s}({\textbf{z}}))d{\textbf{z}}}_{\textcircled {2}} + \underbrace{\int (p_{t}({\textbf{z}})-p_{s}({\textbf{z}})) r_{s}({\textbf{z}})d{\textbf{z}}}_{\textcircled {3}}. \end{aligned} \end{aligned}$$

(7)

The third term in Eq. (7) is zero when $p_{t}({\textbf{z}})=p_{s}({\textbf{z}})$, and the second term can become zero when the labeling function on representation space remains fixed between the source and target domains. Indeed, we have $r_{t}({\textbf{z}})-r_{s}({\textbf{z}}) = \mid \varphi ({\textbf{z}})-\psi _{t}({\textbf{z}}) \mid -\mid \varphi ({\textbf{z}})-\psi _{s}({\textbf{z}}) \mid \le \mid \psi _{t}({\textbf{z}})-\psi _{s}({\textbf{z}}) \mid$. However, as we do not have labels for the target domain, we have no control over the second term. (Wu et al., 2019) studied an upperbound to the third term of Eq. (7), as follows:

$$\begin{aligned} \begin{aligned} \textcircled {3}&= \int {\bigg ({\frac{p_{t}({\textbf{z}})}{p_{s}({\textbf{z}})}}- 1 \bigg ) p_{s}({\textbf{z}}) r_{s}({\textbf{z}})d{\textbf{z}}} \le \bigg (sup_{{\textbf{z}}\in {\mathcal {Z}}}{\frac{p_{t}({\textbf{z}})}{p_{s}({\textbf{z}})}}- 1 \bigg ) \varepsilon _{s} (h) \end{aligned} \end{aligned}$$

(8)

This upperbound shows that if $\varepsilon _{s} (h) = 0$, then the condition $p_{t}({\textbf{z}}) = p_{s}({\textbf{z}})$ is no longer needed to make the third term in Eq. (7) equal to zero. Note that in domain-invariant representation learning, we assume that the ratio $\textstyle {\frac{p_{t}({\textbf{z}})}{p_{s}({\textbf{z}})}}$ is equal to 1. Therefore, this equality enforcement ($\textstyle {\frac{p_{t}({\textbf{z}})}{p_{s}({\textbf{z}})} = 1}$) may deteriorate the adaptability. As a result, we suggest relaxing this equality to $\textstyle {\frac{p_{t}({\textbf{z}})}{p_{s}({\textbf{z}})} = \;|\det (J_{{f}^{-1}}({\textbf{z}}_s)) |}$, which helps to transfer the cross-domain knowledge between the latent spaces without any loss of information.

5 The MapFlow framework

The learning of a joint distribution of source and target data has been studied for domain adaptation (Long et al., 2013; Liu et al., 2017; Courty et al., 2017; Damodaran et al., 2018). However, these methods assume shared latent space or cycle-consistency, which are both rather restrictive, as they impose strict constraints while modeling complex distributions in the latent space (Bouvier et al., 2019; Johansson et al., 2019). To tackle this, a general framework is presented to infer the joint distribution from the marginal ones without any additional assumption on the structure of the joint distribution. In this framework, we generalize the relationship between source and target representations by using an invertible neural network, through which the distribution of the target representation can be modeled without enforcing a strict constraint. We formulate the lower bound on the joint probability distribution over input spaces, which can be leveraged for the following multi-task learning objectives: 1) image translation between two domains, 2) sampling, and 3) classification.

5.1 Framework for joint distribution

We define a joint distribution over image samples and associated labels on the source and target domains as $p_{\tau }({\textbf{x}}_t, {\textbf{x}}_s, {\textbf{y}}_t, {\textbf{y}}_s)$. Assuming the conditional independence between ${\textbf{y}}_t$ and ${\textbf{x}}_s$ given ${\textbf{x}}_t$, and also the conditional independence between ${\textbf{y}}_s$ and ${\textbf{x}}_t$ given ${\textbf{x}}_s$, the joint distribution can be factorized under the chain rule as follows:

$$\begin{aligned} p_{\tau }({\textbf{x}}_t, {\textbf{x}}_s,{\textbf{y}}_t, {\textbf{y}}_s) = p_{\gamma }({\textbf{x}}_t, {\textbf{x}}_s) p_{\beta }({\textbf{y}}_s \vert {\textbf{x}}_s) p_{\alpha }({\textbf{y}}_t \vert {\textbf{y}}_s, {\textbf{x}}_t), \end{aligned}$$

(9)

where $\tau = \{\gamma , \beta , \alpha \}$ represents the model parameters. The third term in Eq. (9) can be interpreted as the probability of the model on target samples, the second term is the classification model on source samples, and the first term is the joint probability distribution over data samples, which can be defined as follows, by considering ${\textbf{z}}_t$ and ${\textbf{z}}_s$ as the latent variables to model the source and target distributions:

$$\begin{aligned} p_{{\gamma }}({\textbf{x}}_t,{\textbf{x}}_s) = \int p_{{\theta }}({\textbf{x}}_t,{\textbf{x}}_s\vert {\textbf{z}}_t,{\textbf{z}}_s)p_{{\eta }}({\textbf{z}}_t,{\textbf{z}}_s)d{\textbf{z}}_t d{\textbf{z}}_s, \end{aligned}$$

(10)

where finding the maximum likelihood of such joint distribution is generally intractable. Thus, we leverage variational inference for jointly modeling distribution. We assume joint variational posterior as $q_{\phi }({\textbf{z}}_t, {\textbf{z}}_s\vert {\textbf{x}}_t,{\textbf{x}}_s)$, then the joint log-evidence lower bound (ELBO) can be derived as follows:

$$\begin{aligned} \begin{aligned} \log p_{{\gamma }}({\textbf{x}}_t,{\textbf{x}}_s)&\ge {\mathbb {E}}_{q_{{\phi }}({\textbf{z}}_t,{\textbf{z}}_s\vert {\textbf{x}}_t, {\textbf{x}}_s)}\big [\log p_{{\theta }}({\textbf{x}}_t, {\textbf{x}}_s\vert {\textbf{z}}_t, {\textbf{z}}_s)\big ]\\&+{\mathbb {E}}_{q_{{\phi }}({\textbf{z}}_t,{\textbf{z}}_s\vert {\textbf{x}}_t, {\textbf{x}}_s)} \big [\log p_{{\eta }}({\textbf{z}}_t, {\textbf{z}}_s)\big ]\\&- {\mathbb {E}}_{q_{{\phi }}({\textbf{z}}_t,{\textbf{z}}_s\vert {\textbf{x}}_t, {\textbf{x}}_s)} \big [\log q_{{\phi }}({\textbf{z}}_t, {\textbf{z}}_s\vert {\textbf{x}}_t, {\textbf{x}}_s)\big ], \end{aligned} \end{aligned}$$

(11)

where the first expectation term is a reconstruction error, the second one refers to the joint prior distribution, and the third expectation term minimizes the entropy of variational posterior. The reconstruction term can be factorized $p_{{\theta }}({\textbf{x}}_t, {\textbf{x}}_s\vert {\textbf{z}}_t,{\textbf{z}}_s) = p_{{\theta _t}}({\textbf{x}}_t\vert {\textbf{z}}_t, {\textbf{z}}_s) p_{{\theta _s}}({\textbf{x}}_s\vert {\textbf{z}}_s)$ by assuming the conditional independence between ${\textbf{x}}_t$ and ${\textbf{x}}_s$ given ${\textbf{z}}_t$. To simplify the third term on the right-hand side (RHS) of Eq. (11), we formulate a factorized variational posterior of the form $q_{{\phi }}({\textbf{z}}_t, {\textbf{z}}_s\vert {\textbf{x}}_t,{\textbf{x}}_s) = q_{{\phi }_t}({\textbf{z}}_t\vert {\textbf{x}}_t) q_{{\phi }_s}({\textbf{z}}_s\vert {\textbf{x}}_s)$, which is consistent with the conditional independence assumption between latent space of one domain and the input space of the other. Also, we define ${\textbf{z}}_t = f({\textbf{z}}_s)$, which leads to factorization of joint prior as $p_{\eta }({\textbf{z}}_t, {\textbf{z}}_s) = p_{\eta }({\textbf{z}}_t\vert {\textbf{z}}_s) p_{\eta }({\textbf{z}}_s)$. Taking all these terms into account and using the chain rule along with Eq. (11), we can derive the final ELBO loss as follows:

$$\begin{aligned} \begin{aligned} {{\mathcal L}_{\gamma }}(\theta _s, \theta _t, \phi _{s}, \phi _{t}, \eta )&= \lambda _{tr}\underbrace{{\mathbb {E}}_{q_{\phi _s}({\textbf{z}}_s\vert {\textbf{x}}_s)}\bigg [{\mathbb {E}}_{p_{\eta }({\textbf{z}}_t\vert {\textbf{z}}_s)}\big [\log p_{\theta _t}({\textbf{x}}_t\vert {\textbf{z}}_t)\big ]\bigg ]}_{({\mathcal {L}}_{tr})}\\&+ \lambda _{sr}\underbrace{{\mathbb {E}}_{q_{\phi _s}({\textbf{z}}_s\vert {\textbf{x}}_s)}\big [\log p_{\theta _s}({\textbf{x}}_s\vert {\textbf{z}}_s)\big ]}_{({\mathcal {L}}_{sr})} \\&+ \lambda _{kl}\underbrace{{\mathbb {E}}_{q_{\phi _{s}}({\textbf{z}}_s\vert {\textbf{x}}_s)}\big [\log p_{\eta }({\textbf{z}}_s)-\log q_{\phi _s}({\textbf{z}}_s\vert {\textbf{x}}_s)\big ]}_{({\mathcal {L}}_{kl})} \\&- \lambda _{f}\underbrace{{\mathbb {E}}_{q_{\phi _t}({\textbf{z}}_t\vert {\textbf{x}}_t)}\bigg [\log p(f^{-1}({\textbf{z}}_t))-\log \Big |\det \frac{\partial f^{-1}}{\partial {\textbf{z}}_t}\Big |\bigg ]}_{({\mathcal {L}}_f)}, \end{aligned} \end{aligned}$$

(12)

where $\lambda = (\lambda _{sr}, \lambda _{tr}, \lambda _{kl}, \lambda _{f})$ are regularization parameters. Further details about the mathematical derivation of this loss can be found in the Appendix A) An illustration of our general framework is provided in Fig. 1. It consists of one feature extractor (encoder) ${g}_{s}({\textbf{x}}_s; \phi _{s})$ to learn posterior distribution for the source domain. We rely on variational inference (VI) to find an approximation ${g}_{s}({\textbf{x}}_s; \phi _{s}) = q_{\phi _s}({\textbf{z}}_s\vert {\textbf{x}}_s)$ for the true latent posterior distribution $p_{\theta } ({\textbf{z}}_s\vert {\textbf{x}}_s)$, which is parameterized by a deep neural network with parameters $\phi _{s}$. Therefore, the representation space of the source domain is forced to be Gaussian with distribution $\textstyle {\displaystyle {\mathcal {N}} ({\textbf{z}}_s \vert \mu _{\phi _s}({\textbf{x}}_s), \sigma ^2_{\phi _s}({\textbf{x}}_s))}$, which can be used as a prior to model target representation.

For the target domain, on the other hand, an invertible neural network constructed by affine coupling layers, which facilitates to compute the Jacobian $\textstyle {J = \frac{\partial f^{-1}}{\partial {\textbf{z}}_t}}$, has been utilized to estimate the density of target encoded samples $\textstyle {{g}_{t}({\textbf{x}}_t; {\phi }_{t}) = q_{{\phi }_t}({\textbf{z}}_t\vert {\textbf{x}}_t)}$.

Let ${\textbf{z}}_s$ with dimension d be the encoded latent variable for unit Gaussian distribution $p({\textbf{z}}_s)$ and let ${\textbf{z}}_t\in {\mathcal {Z}}_t$ be an observation from an unknown target distribution ${\textbf{z}}_t \sim p({\textbf{z}}_t)$. Given $f_{{\eta }}: {\textbf{z}}_s \rightarrow {\textbf{z}}_t$, we define a model $p_{{\eta }}({\textbf{z}}_t)$ with parameters $\eta$ on ${\mathcal {Z}}_t$, and we can compute the negative log-likelihood (NLL) of ${\textbf{z}}_t$ by the change of variable formula. For a single unlabeled target datapoint, the unsupervised objective can be derived as follows:

$$\begin{aligned} \begin{aligned} \log p_{\eta }({\textbf{z}}_t) =&\displaystyle {{\mathcal {L}}}_f(f_{\eta }({\textbf{z}}_t)) = (\log p_{\eta }(f_{\eta }^{-1}({\textbf{z}}_t)) + \log \Big |\det \left(\displaystyle \frac{\partial f_{\eta }^{-1}({\textbf{z}}_t)}{\partial {\textbf{z}}_t}\right)\Big |, \end{aligned} \end{aligned}$$

(13)

where $p_{{\eta }}$ is the prior distribution for the source domain. The minimization of this loss helps to generate a mapping of each unlabeled target sample into the corresponding embedding space.

Note that ${{\mathcal L}_{{\gamma }}}$ in Eq. (12) has five terms, including target reconstruction, source reconstruction, a prior term for source domain, which can be learned with another invertible network, the entropy of source dataset, and a mapping objective from target to source. In our method, the transfer properties are enforced by reconstructing and translating input images. The range of the source representation part has been restrained to be Gaussian.

The second term in Eq. (9) is a predictive function on source datasets. Assuming that $p_{\theta }({\textbf{z}}_s\vert {\textbf{x}}_s)$ can be approximated by the variational posterior $q_{\phi _s}({\textbf{z}}_s\vert {\textbf{x}}_s)$, we have:

$$\begin{aligned} \begin{aligned} p_{\beta }({\textbf{y}}_s \vert {\textbf{x}}_s)&= \int p_{\omega }({\textbf{y}}_s \vert {\textbf{z}}_s)p_{\theta }({\textbf{z}}_s\vert {\textbf{x}}_s) d{\textbf{z}}_s \approx {\mathbb {E}}_{q_{\phi _s}({\textbf{z}}_s \vert {\textbf{x}}_s)}[p_{\omega }({\textbf{y}}_s \vert {\textbf{z}}_s)]. \end{aligned} \end{aligned}$$

(14)

The predictive function ${\varphi }_{\omega }:{\mathcal {Z}}_s \rightarrow {\mathcal {Y}}_s$ enforces separability between classes,

$$\begin{aligned} {\mathcal {L}}_{{\beta }}({\omega }; {\textbf{z}}_s) =- {\mathbb {E}}_{{\textbf{z}}_s \sim q_{{\phi _{s}}}({\textbf{z}}_s \vert ,{\textbf{x}}_s)}[y_{s}^T \ln {\varphi }_\omega ({\textbf{z}}_s)]. \end{aligned}$$

(15)

As for the third term in Eq. (9), since we have no labels for the target domain, to learn a discriminative target representation, we follow (Shu et al., 2018; Kumar et al., 2018), and apply low-density and smoothness assumptions by assuming a conditional entropy (CE) minimization and virtual adversarial training (VAT).

$$\begin{aligned}{} & {} \begin{aligned} {\mathcal {L}}_{{ce}}({\textbf{z}}_t;{\omega }) = -{\mathbb {E}}_{{{\textbf{z}}_t} \sim q_{{\phi }_t}({\textbf{z}}_t\vert {\textbf{x}}_t)}[{\varphi _{{\omega }}({\textbf{z}}_t)}^T \ln \varphi _{{\omega }}({\textbf{z}}_t)] \end{aligned} \end{aligned}$$

(16)

$$\begin{aligned}{} & {} \begin{aligned} {\mathcal {L}}_{{vat}}({\textbf{z}}_t;{\omega })= {\mathbb {E}}_{{{\textbf{z}}_t} \sim q_{{\phi }_t}({\textbf{z}}_t\vert {{\textbf{x}}}_t)}\big [\max _{\left\Vert r\right\Vert \le \epsilon } D_{KL}(\varphi _{{\omega }}({\textbf{z}}_t) \vert \vert \varphi _{{\omega }}({\textbf{z}}_t+r))\big ]. \end{aligned} \end{aligned}$$

(17)

While the conditional entropy minimization (Eq. (16)) forces the predictor to be confident in the unlabeled target data by pushing the decision boundaries away from the target data, VAT loss (Eq. (17)) enforces prediction consistency within the neighborhood of training samples. Note that VAT can be applied on both or either of the source and target distributions.

The overall objective of our proposed MFF to be minimized is given by:

$$\begin{aligned} \begin{aligned} \min _{{\theta }_s, {\theta }_t, {\phi }_{s}, {\phi }_{t}, {\eta }, {\omega }} \quad& {{\mathcal L}_{{\gamma }}}({\theta }_{s}, {\theta }_{t}, {\phi }_{s}, {\phi }_{t}, {\eta }) + \lambda _{s}{\mathcal {L}}_{{\beta }}({\omega }; {\textbf{z}}_s)\\ {}&+ \lambda _{t}({\mathcal {L}}_{{ce}}({\textbf{z}}_t;{\omega }) + {\mathcal {L}}_{{vat}}({\textbf{z}}_t;{\omega })), \end{aligned} \end{aligned}$$

(18)

The objective is overly complex to train the model with. Hence, we further assume a shared encoder (${\theta }_{s} = {\theta }_{t} = {\theta }$), and a shared decoder (${\phi }_{s} = {\phi }_{t} = {\phi }$) for the source and target domains. Moreover, we let the translation loss, i.e., ${\mathcal {L}}_{tr}$ in Eq. (12), to be learned adversarially by employing a discriminator d with extra parameter $\theta _{d}$. Therefore, the overall learning objective will be redefined as follows:

$$\begin{aligned} \begin{aligned} \min _{{\theta }, {\phi }, {\eta }, {\omega }} \max _{\theta _{d}} \quad& {{\mathcal L}_{{\gamma }}}({\theta }, {\phi }, {\eta }) + \lambda _{s}{\mathcal {L}}_{{\beta }}({\omega }; {\textbf{z}}_s) + \lambda _{t}({\mathcal {L}}_{{ce}}({\textbf{z}}_t;{\omega }) + {\mathcal {L}}_{{vat}}({\textbf{z}}_t;{\omega })), \end{aligned} \end{aligned}$$

(19)

where $\mu =({\theta }, {\phi }, {\eta , \omega })$ are all parameters to be learned, and $\lambda = (\lambda _{sr}, \lambda _{tr}, \lambda _{kl}, \lambda _{f}, \lambda _{s}, \lambda _{t})$ are regularization parameters. To simplify the training objective, we also tried to pre-train the flow model with supposedly Gaussian prior by using target auto-encoder (Xiao et al., 2019).

6 Experiments

In this section, we first present the experimental setup, and then we provide details of the implementation of our model, followed by the results, where we compare our model with the SOTA methods in UDA and qualitative analysis of the method.

6.1 Setup

6.1.1 Data sets

To demonstrate the performance of our proposed method, we present our model evaluation on three commonly used digit datasets for UDA: MNIST (LeCun, 1998), SVHN (Netzer et al., 2011), and USPS (Le Cun et al., 1990). For general object classification tasks, we rely on CIFAR-10 (Krizhevsky & Hinton, 2009), STL-10 (Coates et al., 2011), and office-31 (Saenko et al., 2010). Additionally, we evaluate our model for adaptation tasks on the large-scale dataset. In particular, we test on VisDA-2017 (Peng et al., 2017) for the image classification task.

6.1.2 Baselines

We primarily compare our proposed MapFlow with three baselines: ALDA (Chen et al., 2020), MDD+Implicit (Jiang et al., 2020), and VADA (Shu et al., 2018). We also show the results of several other recently proposed UDA models for comparison, including Maximum Classifier Discrepancy (MCD) (Saito et al., 2018), Joint Adaptation Network (JAN) (Long et al., 2017), Self-Ensembling(S-En) (French, 2017), and Conditional Domain Adversarial Networks (CDAN) (Long et al., 2018). For a fair comparison, the results are reported from the original papers if available. For all the experiments, we will report the results in terms of accuracy for each domain shift, repeating the experiments 3 times and averaging the results.

6.2 Implementation

6.2.1 Architecture

In order to make fair comparisons for digits and CIFAR10/STL datasets, we adopt the architectural components, including the classifier network, the feature extractor, and the discriminator used in DIRT-T (Shu et al., 2018). Similarly, we use a small architecture for the digits UDA tasks and a larger architecture for UDA experiments between CIFAR-10 and STL-10. For office-31 and VisDA 2017 datasets, we employ ResNet-50 (He et al., 2016), which is pre-trained on ImageNet (Russakovsky et al., 2015), as the feature extractor. The discriminator network is composed of two fully connected layers with dropout (Ganin et al., 2016). Note that our architecture is slightly different as we include an invertible feature transform to the classifier network; however, the invertible network only adds a small parameter overhead on the shared feature extractor and classifier (less than 4%). For the invertible network applied on latent variables, we use Glow architecture (Kingma & Dhariwal, 2018) with 4 affine coupling blocks, where each block contains 3 fully connected layers, each with 256 or 512 hidden units depending on the dataset.

6.2.2 Training settings and hyper-parameters

For digits and CIFAR10/STL datasets, we implement adversarial training via alternating updates (Shu et al., 2018), and train the model using Adam optimizer (Kingma & Ba, 2014) with learning rate ${10}^{-3}$ decaying by a factor of 2 after 200 epochs.

For office-31 and VisDA-2017 datasets, we follow (Chen et al., 2020) and all the protocols, including optimizer and learning rate strategy. We optimize the model using Stochastic Gradient Descent (SGD) optimizer with a momentum of 0.9 and an adjusted earning rate $\eta _p = \eta _0 (1+\alpha q) \gamma$, where $\eta _0= 0.01$, $\alpha = 10$, $\gamma = 0.75$, and q is the training progress linearly decreasing from 1 to 0. Note that we set the learning rates of the classifier and discriminator to be 10 times that of the generator.

As for hyper-parameters $(\lambda _{sr}, \lambda _{tr}, \lambda _{kl}, \lambda _{s}, \lambda _{t})$, we tune the values for each dataset using cross validation. We observed that extensive hyper-parameter tuning is not required to obtain top-performance results. Accordingly, we limit the hyper-parameter search for each task to $\lambda _{sr} = \lambda _{tr}= \{10 ^{-1}, 10^{-2}\}, \lambda _{kl} = \{1, 10^{-1}\}, \lambda _{s} = \{1\}, \lambda _{t} = \{0, 1, 10^{-1}, 10^{-2}\}$

6.3 Results

Table 1 Test accuracy ($\%$) on standard domain adaptation benchmarks

Full size table

Table 1 summarizes the results of the average accuracy ($\%$) on the standard classification benchmarks for UDA, such as digits, CIFAR-10, and STL data sets, compared with SOTA methods. For fair comparison, we resize all images to $32 \times 32 \times 3$ (except in case of adaptation from USPS to MNIST) and apply instance normalization (Shu et al., 2018) to input images. Below, we present a brief analysis of the results in Table 1.

USPS$\varvec{\rightarrow }$MNIST: although USPS contains a smaller training set than MNIST, domain discrepancy between these two datasets is relatively small, and we could achieve high performance in USPS $\rightarrow$ MNIST.

MNIST$\varvec{\leftrightarrow }$SVHN: for the adaptation task SVHN $\rightarrow$ MNIST, we modify the dimension of MNIST to $32 \times 32$ of SVHN, with three channels. This adaptation problem is easily solved When the proposed MapFlow is applied. Our method could demonstrate a performance similar to the SOTA DTA (Lee et al., 2019) on MNIST. The reverse problem, the adaptation task MNIST $\rightarrow$ SVHN, can be regarded as the most challenging case in digit datasets, as MNIST has a considerably lower dimensionality than SVHN. Experiments show that MapFlow could achieve state-of-the-art results on this adaptation task. On average, MapFlow achieved $\mathbf {4.8\%}$ improvements compared with the method of DIRT-T (Shu et al., 2018). The improvement shows the importance of relaxed invariant representation.

CIFAR-10$\varvec{\leftrightarrow }$STL-10: in both adaptation directions, results in Table 1 show that MapFlow is slightly better than the SOTA, which we believe is due to the relatively smaller training set for STL and the existing imbalance between two datasets.

Table 2 Test Accuracy (%) on Office-31 adaptation tasks for unsupervised domain adaptation (ResNet-50)

Full size table

The results in Table 2 show again the superiority of our approach compared to other recently proposed methods on Office-31 datasets. We evaluate MapFlow across six UDA tasks: $\text {A} \rightarrow \text {W}$, $\text {W} \rightarrow \text {D}$, $\text {D} \rightarrow \text {W}$, $\text {A} \rightarrow \text {D}$, $\text {D} \rightarrow \text {A}$, and $\text {W} \rightarrow \text {A}$. Our method surpasses the baselines in 3 out of 6 pairs of adaptation tasks for Office-31. We further demonstrate the generalization ability of the proposed method by conducting additional experiments on VisDA-2017. In our experiments, we observed a gain of 0.6 points over the baseline (Chen et al., 2020), confirming the flexibility of MapFlow and its applicability across UDA tasks. The SOTA results with ResNet-50 are reported in Table 3.

Table 3 Test accuracy (%) on VisDA-2017 for unsupervised domain adaptation (ResNet-50)

Full size table

6.4 Ablation studies

To examine the relative contribution of the invertible network in MapFlow, we conduct ablations on the adaptation tasks presented in Table 1, with and without the loss term of E.q 13. The results are reported in Table 4, where the “no-nf” subscript denotes the removal of the NF component. We observe that when the loss, including the term for the log determinant of Jacobian, is applied (MapFlow), our method demonstrates a significant improvement over MapFlow$_{\mathrm {no-nf}}$ and previous works. These results demonstrate the effectiveness of the flow model in relaxing invariance enforcement.

Table 4 Test accuracy ($\%$) on standard domain adaptation benchmarks in ablation experiment

Full size table

6.5 Analysis

6.5.1 Qualitative analysis

To further analyze the relaxed invariant representation, we visualize the non-adapted and adapted feature representations generated from the last hidden layer of the model on SVHN $\rightarrow$ MNIST UDA task using t-SNE (Van der Maaten & Hinton, 2008). As illustrated in Fig. 2, source-only training or Non-adapted model shows strong clustering of the SVHN samples and performs poorly on MNIST (Fig. 2a). MapFlow delivers higher feature discriminability in the target domain by keeping each class well separated without enforcing the target clusters to be completely aligned with the source domain (Fig. 2c).

6.5.2 Target error bound

The learning theory of UDA was initially proposed by Ben-David (Ben-David et al., 2010) and is summarized in Theorem 1.

Theorem 1

(Ben-David et al., 2010) Let ${\mathcal {H}} = \{\varphi \circ g: \varphi \in \Phi ,g \in {\mathcal {G}} \}$ be the hypothesis space, where ${\mathcal {G}}$ and $\Phi$ are considered to be the set of representations and predictive functions respectively, and let $\varepsilon (h)$ be the risk for $h \in {\mathcal {H}}$, and $\varepsilon (h, h')$ be the risk for $(h, h') \in {\mathcal {H}}^2$.

$$\begin{aligned} \varepsilon _t(g \circ \varphi ) \le \underbrace{\varepsilon _s(g \circ \varphi )}_{\textcircled {1}} + \underbrace{\frac{1}{2} d_{{\mathcal {H}} \Delta {\mathcal {H}}}(p_{s}, p_{t})}_ {\textcircled {2}}+ \underbrace{\Psi (h)}_{\textcircled {3}}, \end{aligned}$$

(20)

where $d_{{\mathcal {H}} \Delta {\mathcal {H}}}$ in the second term denotes ${\mathcal {H}} \Delta {\mathcal {H}}$ distance between source and target domains, $\Psi (h)$ is the shared error of the ideal joint hypothesis, and

$$\begin{aligned} d_{{\mathcal {H}} \Delta {\mathcal {H}}}(p_{s}, p_{t}) = 2\sup \limits _{h, h'\in {\mathcal {H}}} \Big |\varepsilon _s(h, h')-\varepsilon _t(h, h')\Big |\end{aligned}$$

(21)

$$\begin{aligned} \Psi (h) = \inf \limits _{h\in {\mathcal {H}}} \varepsilon _s(h)+\varepsilon _t(h). \end{aligned}$$

(22)

We analyze the second term, domain discrepancy, and the third term, ideal joint hypothesis error, of the target error bound, as formulated in Eq. (20), on SVHN $\rightarrow$ MNIST task.

Domain Discrepancy Domain discrepancy can be estimated approximately by ${\mathcal {A}}$-distance (Ben-David et al., 2010), defined as ${\mathcal {A}} = 2(1-2\epsilon )$, where $\epsilon$ denotes the error of a domain classifier trained to discriminate the source and target representations. As illustrated in Fig. 3a, MapFlow minimizes domain discrepancy more significantly than standard domain adversarial training but does not as much as VADA (Shu et al., 2018) method does, implying a relaxed invariance.

Ideal Joint Hypothesis We evaluate the Ideal Joint Hypothesis by training an MLP classifier with two layers on the adapted features from both target and source domains, as suggested in (Chen et al., 2019). As shown in Fig. 3b, MapFlow reduces the joint error, which indicates that our method improves the feature discriminability.

7 Conclusion and future work

In this paper, a novel relaxed invariant representation learning is presented for unsupervised domain adaptation. In standard domain invariance learning, the transferability of feature representations is enhanced at the expense of its discriminability. Thus, we propose a general framework to relax invariance enforcement in representation space. Our method aims at encouraging a combination of domain invariance and specificity to enhance the target discriminability. The framework relies on normalizing flow to learn a transformation between the distribution of target and source domains in representation space. In fact, normalizing flow maps a complex target latent distribution into a well-clustered latent source distribution through a sequence of invertible functions. We mathematically derived a variational lower bound for the probability distribution changing across domains and showed the consistency of the lower bound with the relaxed invariance assumption. Through extensive experiments, our approach demonstrates its superiority to other methods based on invariant representations on several public UDA datasets, validating our analysis. In future work, we intend to extend our model to work in the presence of label and conditional shifts for domain adaptation.

Availability of data and materials

The data and material used in this paper are publicly available.

Code availability

It would be available.

References

Acuna, D., Law, M. T., Zhang, G., & Fidler, S. (2022). Domain adversarial training: A game perspective. arXiv preprint arXiv:2202.05352
Ardizzone, L., Kruse, J., Wirkert, S., Rahner, D., Pellegrini, E. W., Klessen, R. S., Maier-Hein, L., Rother, C., & Köthe, U. (2018). Analyzing inverse problems with invertible neural networks. arXiv preprint arXiv:1808.04730
Arjovsky, M., Bottou, L., Gulrajani, I., & Lopez-Paz, D. (2019). Invariant risk minimization. arXiv preprint arXiv:1907.02893
Ben-David, S., Blitzer, J., Crammer, K., & Pereira, F. (2007). Analysis of representations for domain adaptation. In Advances in neural information processing systems (pp. 137–144)
Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., & Vaughan, J. W. (2010). A theory of learning from different domains. Machine Learning, 79(1–2), 151–175.
Article MathSciNet MATH Google Scholar
Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., & Erhan, D. (2016). Domain separation networks. In Advances in neural information processing systems (pp. 343–351)
Bouvier, V., Hudelot, C., Chastagnol, C., Very, P., & Tami, M. (2019). Domain-invariant representations: A look on compression and weights
Chen, X., Wang, S., Long, M., & Wang, J. (2019). Transferability vs. discriminability: Batch spectral penalization for adversarial domain adaptation. In International conference on machine learning (pp. 1081–1090). PMLR
Chen, M., Zhao, S., Liu, H., & Cai, D. (2020). Adversarial-learned loss for domain adaptation. In Proceedings of the AAAI conference on artificial intelligence (vol. 34, pp. 3521–3528)
Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., & Shen, C. (2021). Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems, 34, 9355–9366.
Google Scholar
Coates, A., Ng, A., & Lee, H. (2011). An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics (pp. 215–223)
Courty, N., Flamary, R., Habrard, A., & Rakotomamonjy, A. (2017). Joint distribution optimal transportation for domain adaptation. Advances in Neural Information Processing Systems, 30
Cui, S., Wang, S., Zhuo, J., Li, L., Huang, Q., & Tian, Q. (2020). Towards discriminability and diversity: Batch nuclear-norm maximization under label insufficient situations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3941–3950)
Cui, S., Wang, S., Zhuo, J., Su, C., Huang, Q., & Tian, Q. (2020). Gradually vanishing bridge for adversarial domain adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12455–12464)
Cui, S., Jin, X., Wang, S., He, Y., & Huang, Q. (2020). Heuristic domain adaptation. Advances in Neural Information Processing Systems, 33, 7571–7583.
Google Scholar
Damodaran, B. B., Kellenberger, B., Flamary, R., Tuia, D., & Courty, N. (2018). Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In European conference on computer vision (pp. 467–483). Springer
Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using real nvp. arXiv preprint arXiv:1605.08803
French. (2017). Self-ensembling for visual domain adaptation. arXiv preprint arXiv:1706.05208
Ganin, Y., & Lempitsky, V. (2014). Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495
Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., & Lempitsky, V. (2016). Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1), 2096–2030.
MathSciNet MATH Google Scholar
Gao, J., Zhang, J., Liu, X., Darrell, T., Shelhamer, E., & Wang, D. (2022). Back to the source: Diffusion-driven test-time adaptation. arXiv preprint arXiv:2207.03442
Gong, R., Li, W., Chen, Y., & Gool, L. V. (2019). Dlow: Domain flow for adaptation and generalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2477–2486)
Grover, A., Chute, C., Shu, R., Cao, Z., & Ermon, S. (2019). Alignflow: Cycle consistent learning from multiple domains via normalizing flows. arXiv preprint arXiv:1905.12892
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778)
Hoffman, J., Tzeng, E., Park, T., Zhu, J. -Y., Isola, P., Saenko, K., Efros, A. A., & Darrell, T. (2017). Cycada: Cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213
Hoffman, J., Tzeng, E., Park, T., Zhu, J. -Y., Isola, P., Saenko, K., Efros, A., & Darrell, T. (2018). Cycada: Cycle-consistent adversarial domain adaptation. In International conference on machine learning (pp. 1989–1998). Pmlr
Huang, J., Guan, D., Xiao, A., & Lu, S. (2021). Model adaptation: Historical contrastive learning for unsupervised domain adaptation without source data. Advances in Neural Information Processing Systems, 34, 3635–3649.
Google Scholar
Izmailov, P., Kirichenko, P., Finzi, M., & Wilson, A. G. (2020). Semi-supervised learning with normalizing flows. In International conference on machine learning (pp. 4615–4630). PMLR
Jiang, X., Lao, Q., Matwin, S., & Havaei, M. (2020). Implicit class-conditioned domain alignment for unsupervised domain adaptation. In International conference on machine learning (pp. 4816–4827). PMLR
Jin, X., Lan, C., Zeng, W., & Chen, Z. (2020). Feature alignment and restoration for domain generalization and adaptation. arXiv preprint arXiv:2006.12009
Jin, Y., Wang, X., Long, M., & Wang, J. (2020). Minimum class confusion for versatile domain adaptation. In European conference on computer vision (pp. 464–480). Springer
Johansson, F. D., Ranganath, R., & Sontag, D. (2019). Support and invertibility in domain-invariant representations. arXiv preprint arXiv:1903.03448
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Kingma, D. P., & Dhariwal, P. (2018). Glow: Generative flow with invertible 1x1 convolutions. In Advances in neural information processing systems (pp. 10215–10224)
Krizhevsky, A., Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical report, Citeseer
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105.
Google Scholar
Kumar, A., Ma, T., & Liang, P. (2020). Understanding self-training for gradual domain adaptation. In International conference on machine learning (pp. 5468–5479). PMLR
Kumar, A., Sattigeri, P., Wadhawan, K., Karlinsky, L., Feris, R., Freeman, B., & Wornell, G. (2018). Co-regularized alignment for unsupervised domain adaptation. In Advances in neural information processing systems (pp. 9345–9356)
Kundu, J. N., Kulkarni, A. R., Bhambri, S., Mehta, D., Kulkarni, S. A., Jampani, V., & Radhakrishnan, V. B. (2022). Balancing discriminability and transferability for source-free domain adaptation. In International conference on machine learning (pp. 11710–11728). PMLR
Le Cun, Y., Matan, O., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., Jackel, L., & Baird, H. S. (1990). Handwritten zip code recognition with multilayer networks. In Proceedings of 10th international conference on pattern recognition (vol. 2, pp. 35–40)
LeCun, Y. (1998). The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/
Lee, C. -Y., Batra, T., Baig, M. H., & Ulbricht, D. (2019). Sliced wasserstein discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10285–10295)
Lee, S., Kim, D., Kim, N., & Jeong, S. -G. (2019). Drop to adapt: Learning discriminative features for unsupervised domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 91–100)
Li, Y., Wang, N., Shi, J., Liu, J., & Hou, X. (2016). Revisiting batch normalization for practical domain adaptation. arXiv preprint arXiv:1603.04779
Liang, J., Hu, D., & Feng, J. (2020). Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International conference on machine learning (pp. 6028–6039). PMLR
Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., Van Der Laak, J. A., Van Ginneken, B., & Sánchez, C. I. (2017). A survey on deep learning in medical image analysis. Medical Image Analysis, 42, 60–88.
Article Google Scholar
Liu, M. -Y., & Tuzel, O. (2016). Coupled generative adversarial networks. In Advances in neural information processing systems (pp. 469–477)
Liu, M. -Y., Breuel, T., & Kautz, J. (2017). Unsupervised image-to-image translation networks. arXiv preprint arXiv:1703.00848
Liu, H., Long, M., Wang, J., & Jordan, M. I. (2019). Transfer adversarial training: A general approach to adapting deep classifiers. Transfer 1/20
Liu, H., Wang, J., & Long, M. (2021). Cycle self-training for domain adaptation. Advances in Neural Information Processing Systems, 34, 22968–22981.
Google Scholar
Long, M., Cao, Y., Wang, J., & Jordan, M. I. (2015). Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791
Long, M., Cao, Z., Wang, J., & Jordan, M. I. (2018). Conditional adversarial domain adaptation. In Advances in neural information processing systems (pp. 1640–1650)
Long, M., Wang, J., Ding, G., Sun, J., & Yu, P. S. (2013). Transfer feature learning with joint distribution adaptation. In Proceedings of the IEEE international conference on computer vision (pp. 2200–2207)
Long, M., Zhu, H., Wang, J., & Jordan, M. I. (2017). Deep transfer learning with joint adaptation networks. In Proceedings of the 34th international conference on machine learning (vol. 70, pp. 2208–2217). JMLR. org
Maria Carlucci, F., Porzi, L., Caputo, B., Ricci, E., & Rota Bulo, S. (2017). Autodial: Automatic domain alignment layers. In Proceedings of the IEEE international conference on computer vision (pp. 5067–5075)
Murez, Z., Kolouri, S., Kriegman, D., Ramamoorthi, R., & Kim, K. (2018). Image to image translation for domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4500–4509)
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A. Y. (2011). Reading digits in natural images with unsupervised feature learning
Nguyen, A. T., Tran, T., Gal, Y., Torr, P. H., & Baydin, A. G. (2021). Kl guided domain adaptation. arXiv preprint arXiv:2106.07780
Papamakarios, G., Nalisnick, E. T., Rezende, D. J., Mohamed, S., & Lakshminarayanan, B. (2021). Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57), 1–64.
MathSciNet MATH Google Scholar
Pei, Z., Cao, Z., Long, M., & Wang, J. (2018). Multi-adversarial domain adaptation. In Proceedings of the AAAI conference on artificial intelligence (vol. 32)
Peng, X., Usman, B., Kaushik, N., Hoffman, J., Wang, D., & Saenko, K. (2017). Visda: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924
Rangwani, H., Aithal, S. K., Mishra, M., Jain, A., & Radhakrishnan, V. B. (2022). A closer look at smoothness in domain adversarial training. In International conference on machine learning (pp. 18378–18399). PMLR
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Article MathSciNet Google Scholar
Saenko, K., Kulis, B., Fritz, M., & Darrell, T. (2010). Adapting visual category models to new domains. In European conference on computer vision (pp. 213–226). Springer
Sagawa, S., & Hino, H. (2022). Gradual domain adaptation via normalizing flows. arXiv preprint arXiv:2206.11492
Saito, K., Ushiku, Y., & Harada, T. (2017). Asymmetric tri-training for unsupervised domain adaptation. In Proceedings of the 34th international conference on machine learning (vol. 70, pp. 2988–2997). JMLR. org
Saito, K., Ushiku, Y., Harada, T., & Saenko, K. (2017). Adversarial dropout regularization. arXiv preprint arXiv:1711.01575
Saito, K., Watanabe, K., Ushiku, Y., & Harada, T. (2018). Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3723–3732)
Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2), 227–244.
Article MathSciNet MATH Google Scholar
Shu, R., Bui, H. H., Narui, H., & Ermon, S. (2018). A dirt-t approach to unsupervised domain adaptation. arXiv preprint arXiv:1802.08735
Sun, B., & Saenko, K. (2016). Deep coral: Correlation alignment for deep domain adaptation. In European conference on computer vision (pp. 443–450). Springer
Sun, B., Feng, J., & Saenko, K. (2017). Correlation alignment for unsupervised domain adaptation. In Domain adaptation in computer vision applications (pp. 153–171). Springer
Sun, Y., Tzeng, E., Darrell, T., & Efros, A. A. (2019). Unsupervised domain adaptation through self-supervision. arXiv preprint arXiv:1909.11825
Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105–6114). PMLR
Tong, S., Garipov, T., Zhang, Y., Chang, S., & Jaakkola, T. S. (2022). Adversarial support alignment. arXiv preprint arXiv:2203.08908
Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7167–7176)
Usman, B., Sud, A., Dufour, N., & Saenko, K. (2020). Log-likelihood ratio minimizing flows: Towards robust and quantifiable neural distribution alignment. Advances in Neural Information Processing Systems, 33, 21118–21129.
Google Scholar
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9(11)
Voulodimos, A., Doulamis, N., Doulamis, A., & Protopapadakis, E. (2018). Deep learning for computer vision: A brief review. Computational Intelligence and Neuroscience, 2018
Wang, X., Li, L., Ye, W., Long, M., & Wang, J. (2019). Transferable attention for domain adaptation. In Proceedings of the AAAI conference on artificial intelligence (vol. 33, pp. 5345–5352)
Wang, D., Shelhamer, E., Liu, S., Olshausen, B., & Darrell, T. (2020). Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726
Wang, Z., Cheng, X., Sapiro, G., & Qiu, Q. (2020). A dictionary approach to domain-invariant learning in deep networks. Advances in Neural Information Processing Systems, 33, 6595–6605.
Google Scholar
Wei, G., Lan, C., Zeng, W., & Chen, Z. (2021). Metaalign: Coordinating domain alignment and classification for unsupervised domain adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16643–16653)
Wei, G., Lan, C., Zeng, W., Zhang, Z., & Chen, Z. (2021). Toalign: Task-oriented alignment for unsupervised domain adaptation. Advances in Neural Information Processing Systems, 34, 13834–13846.
Google Scholar
Wu, Y., Winston, E., Kaushik, D., & Lipton, Z. (2019). Domain adaptation with asymmetrically-relaxed distribution alignment. arXiv preprint arXiv:1903.01689
Xiao, Z., Yan, Q., & Amit, Y. (2019). Generative latent flow. arXiv preprint arXiv:1905.10485
Xu, T., Chen, W., Wang, P., Wang, F., Li, H., & Jin, R. (2021). Cdtrans: Cross-domain transformer for unsupervised domain adaptation. arXiv preprint arXiv:2109.06165
Xu, R., Li, G., Yang, J., & Lin, L. (2019). Larger norm more transferable: An adaptive feature norm approach for unsupervised domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1426–1435)
Zhang, Y., Liu, T., Long, M., & Jordan, M. (2019). Bridging theory and algorithm for domain adaptation. In International conference on machine learning (pp. 7404–7413). PMLR
Zhang, Y., Tang, H., Jia, K., & Tan, M. (2019). Domain-symmetric networks for adversarial domain adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5031–5040)
Zhao, H., Des Combes, R. T., Zhang, K., & Gordon, G. (2019). On learning invariant representations for domain adaptation. In International conference on machine learning (pp. 7523–7532). PMLR
Zhu, L., Wang, W., Zhang, M. H., Ooi, B. C., & Yao, C. (2019). Distribution matching prototypical network for unsupervised domain adaptation

Download references

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions. None.

Author information

Authors and Affiliations

School of Electrical Engineering and Computer Science, University of Queensland, St Lucia, Brisbane, QLD, 4072, Australia
Hossein Askari & Hongfu Sun
Australian Institute for Machine Learning, Adelaide University, North Terrace, Adelaide, SA, 5005, Australia
Yasir Latif

Authors

Hossein Askari
View author publications
You can also search for this author in PubMed Google Scholar
Yasir Latif
View author publications
You can also search for this author in PubMed Google Scholar
Hongfu Sun
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

YL contributed to this work through technical discussion, paper edition, and revision. HS contributed to this work through technical discussion, paper editions, revision, and supervision. I am submitting the enclosed manuscript for potential publication only in Machine Learning Journal. I attest that this paper has not been published anywhere and is prepared following the instructions to authors. The second author has contributed to this manuscript and reviewed and approved the current form of the manuscript to be submitted.

Corresponding author

Correspondence to Hossein Askari.

Ethics declarations

Conflict of interest

Not applicable

Ethical approval

Not applicable

Consent to participate

Not applicable

Consent for publication

Not applicable

Additional information

Editor: Derek Greene.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A derivation of the ELBO

$$\begin{aligned} \begin{aligned} \log p({\textbf{x}}_t,{\textbf{x}}_s)&= \log \int p({\textbf{x}}_t,{\textbf{x}}_s\vert {\textbf{z}}_t,{\textbf{z}}_s)p({\textbf{z}}_t,{\textbf{z}}_s)d{\textbf{z}}_sd{\textbf{z}}_t\\&= \log \int p({\textbf{x}}_t,{\textbf{x}}_s\vert {\textbf{z}}_t,{\textbf{z}}_s)p({\textbf{z}}_t,{\textbf{z}}_s)\frac{q({\textbf{z}}_t,{\textbf{z}}_s\vert {\textbf{x}}_t,{\textbf{x}}_s)}{q({\textbf{z}}_t,{\textbf{z}}_s\vert {\textbf{x}}_t,{\textbf{x}}_s)}d{\textbf{z}}_sd{\textbf{z}}_t\\&= \log {\mathbb {E}}_{q({\textbf{z}}_t,{\textbf{z}}_s\vert {\textbf{x}}_t,{\textbf{x}}_s)}\big [p({\textbf{x}}_t,{\textbf{x}}_s\vert {\textbf{z}}_t,{\textbf{z}}_s)\frac{p({\textbf{z}}_t,{\textbf{z}}_s)}{q({\textbf{z}}_t,{\textbf{z}}_s\vert {\textbf{x}}_t,{\textbf{x}}_s)}\big ]\\&\ge {\mathbb {E}}_{q({\textbf{z}}_t,{\textbf{z}}_s\vert {\textbf{x}}_t,{\textbf{x}}_s)}\big [\log \Big (p({\textbf{x}}_t,{\textbf{x}}_s\vert {\textbf{z}}_t,{\textbf{z}}_s)\frac{p({\textbf{z}}_t,{\textbf{z}}_s)}{q({\textbf{z}}_t,{\textbf{z}}_s\vert {\textbf{x}}_t,{\textbf{x}}_s)}\Big )\big ]\\&= {\mathbb {E}}_{q({\textbf{z}}_t,{\textbf{z}}_s\vert {\textbf{x}}_t,{\textbf{x}}_s)}\big [\log (p({\textbf{x}}_t,{\textbf{x}}_s\vert {\textbf{z}}_t,{\textbf{z}}_s))+\log (p({\textbf{z}}_t,{\textbf{z}}_s))\big ]\\&\qquad \qquad -{\mathbb {E}}_{q({\textbf{z}}_t,{\textbf{z}}_s\vert {\textbf{x}}_t,{\textbf{x}}_s)}\big [\log (q({\textbf{z}}_t,{\textbf{z}}_s\vert {\textbf{x}}_t,{\textbf{x}}_s))\big ] \end{aligned} \end{aligned}$$

(A1)

The last result from the Eq. (A1) is the ELBO of the the joint distribution. We further assume the conditional independence between source and target distribution:

$$\begin{aligned} \begin{aligned} q({\textbf{z}}_t,{\textbf{z}}_s\vert {\textbf{x}}_t,{\textbf{x}}_s)&= q({\textbf{z}}_t\vert {\textbf{x}}_t)q({\textbf{z}}_s\vert {\textbf{x}}_s),\\ p({\textbf{x}}_t,{\textbf{x}}_s\vert {\textbf{z}}_t,{\textbf{z}}_s)&= p({\textbf{x}}_t\vert {\textbf{z}}_t,{\textbf{z}}_s)p({\textbf{x}}_s\vert {\textbf{z}}_s). \end{aligned} \end{aligned}$$

(A2)

Therefore, we can derive the following result.

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{q({\textbf{z}}_t\vert {\textbf{x}}_t)q({\textbf{z}}_s\vert {\textbf{x}}_s)}\big [\log (p({\textbf{x}}_t\vert {\textbf{z}}_t,{\textbf{z}}_s)p({\textbf{x}}_s\vert {\textbf{z}}_s))+\log (p({\textbf{z}}_t,{\textbf{z}}_s))-\log (q({\textbf{z}}_t\vert {\textbf{x}}_t)q({\textbf{z}}_s\vert {\textbf{x}}_s))\big ]\\&\quad = {\mathbb {E}}_{q({\textbf{z}}_t\vert {\textbf{x}}_t)q({\textbf{z}}_s\vert {\textbf{x}}_s)}\big [\log (p({\textbf{x}}_t\vert {\textbf{z}}_t,{\textbf{z}}_s)+\log (p({\textbf{x}}_s\vert {\textbf{z}}_s))+\log (p({\textbf{z}}_t,{\textbf{z}}_s))\big ]\\ {}&\qquad -{\mathbb {E}}_{q({\textbf{z}}_t\vert {\textbf{x}}_t)q({\textbf{z}}_s\vert {\textbf{x}}_s)}\big [\log (q({\textbf{z}}_t\vert {\textbf{x}}_t))-\log (q({\textbf{z}}_s\vert {\textbf{x}}_s))\big ]\\&\quad = {\mathbb {E}}_{q({\textbf{z}}_t\vert {\textbf{x}}_t)q({\textbf{z}}_s\vert {\textbf{x}}_s)}\big [\log (p({\textbf{x}}_t\vert {\textbf{z}}_t,{\textbf{z}}_s)\big ]+{\mathbb {E}}_{q({\textbf{z}}_s\vert {\textbf{x}}_s)}\big [\log (p({\textbf{x}}_s\vert {\textbf{z}}_s))\big ]\\&\qquad +{\mathbb {E}}_{q({\textbf{z}}_t\vert {\textbf{x}}_t)q({\textbf{z}}_s\vert {\textbf{x}}_s)}\big [\log (p({\textbf{z}}_t,{\textbf{z}}_s))-\log (q({\textbf{z}}_t\vert {\textbf{x}}_t))-\log (q({\textbf{z}}_s\vert {\textbf{x}}_s))\big ]\\&\quad = {\mathbb {E}}_{q({\textbf{z}}_t\vert {\textbf{x}}_t)q({\textbf{z}}_s\vert {\textbf{x}}_s)}[\log (p({\textbf{x}}_t\vert {\textbf{z}}_t,{\textbf{z}}_s)]+{\mathbb {E}}_{q({\textbf{z}}_s\vert {\textbf{x}}_s)}[\log (p({\textbf{x}}_s\vert {\textbf{z}}_s))]\\&\qquad +{\mathbb {E}}_{q({\textbf{z}}_t\vert {\textbf{x}}_t)q({\textbf{z}}_s\vert {\textbf{x}}_s)}[\log (p({\textbf{z}}_t,{\textbf{z}}_s))]-{\mathbb {E}}_{q({\textbf{z}}_s\vert {\textbf{x}}_s)}[\log (q({\textbf{z}}_s\vert {\textbf{x}}_s))]\\&\qquad -{\mathbb {E}}_{q({\textbf{z}}_t\vert {\textbf{x}}_t)}[\log (q({\textbf{z}}_t\vert {\textbf{x}}_t))]. \end{aligned} \end{aligned}$$

(A3)

We rewrite the last result of Eq. (A3).

$$\begin{aligned} \begin{aligned}&= \underbrace{{\mathbb {E}}_{q({\textbf{z}}_t\vert {\textbf{x}}_t)q({\textbf{z}}_s\vert {\textbf{x}}_s)}\big [\log (p({\textbf{x}}_t\vert {\textbf{z}}_t,{\textbf{z}}_s)\big ]}_{(1)}+\underbrace{{\mathbb {E}}_{q({\textbf{z}}_s\vert {\textbf{x}}_s)}\big [\log (p({\textbf{x}}_s\vert {\textbf{z}}_s))\big ]}_{(2)}\\&\quad +\underbrace{{\mathbb {E}}_{q({\textbf{z}}_t\vert {\textbf{x}}_t)q({\textbf{z}}_s\vert {\textbf{x}}_s)}\big [\log (p({\textbf{z}}_t,{\textbf{z}}_s))\big ]}_{(3)}-\underbrace{{\mathbb {E}}_{q({\textbf{z}}_t\vert {\textbf{x}}_t)}\big [\log (q({\textbf{z}}_t\vert {\textbf{x}}_t)\big ]}_{(4)}\\&\qquad -\underbrace{{\mathbb {E}}_{q({\textbf{z}}_s\vert {\textbf{x}}_s)}\big [\log (q({\textbf{z}}_s\vert {\textbf{x}}_s))\big ]}_{(5)}, \end{aligned} \end{aligned}$$

(A4)

Nothing that

$$\begin{aligned} \begin{aligned} p({\textbf{x}}_t\vert {\textbf{z}}_t,{\textbf{z}}_s)&= \frac{p({\textbf{z}}_t,{\textbf{z}}_s\vert {\textbf{x}}_t)p({\textbf{x}}_t)}{p({\textbf{z}}_t,{\textbf{z}}_s)}\qquad \qquad \text {(Bayes' theorem)}\\&= \frac{p({\textbf{z}}_t\vert {\textbf{z}}_s,{\textbf{x}}_t)p({\textbf{z}}_s\vert {\textbf{x}}_t)p({\textbf{x}}_t)}{p({\textbf{z}}_t,{\textbf{z}}_s)}\qquad \text {(Chain rule)} \\&= \frac{p({\textbf{z}}_t\vert {\textbf{z}}_s)p({\textbf{z}}_s\vert {\textbf{x}}_t)p({\textbf{x}}_t)}{p({\textbf{z}}_t,{\textbf{z}}_s)}\\&= \frac{p({\textbf{z}}_t\vert {\textbf{z}}_s)p({\textbf{z}}_s\vert {\textbf{x}}_t)p({\textbf{x}}_t)}{p({\textbf{z}}_t\vert {\textbf{z}}_s)p({\textbf{z}}_s)}\\&= \frac{p({\textbf{z}}_s\vert {\textbf{x}}_t)p({\textbf{x}}_t)}{p({\textbf{z}}_s)}=\frac{p({\textbf{z}}_s,{\textbf{x}}_t)}{p({\textbf{z}}_s)}=\frac{p({\textbf{x}}_t\vert {\textbf{z}}_s)p({\textbf{z}}_s)}{p({\textbf{z}}_s)}=p({\textbf{x}}_t\vert {\textbf{z}}_s), \end{aligned} \end{aligned}$$

(A5)

then, the term (1) in Eq. (A4) can be redefined as

$$\begin{aligned} \begin{aligned} (1)&= {\mathbb {E}}_{q({\textbf{z}}_t\vert {\textbf{x}}_t)q({\textbf{z}}_s\vert {\textbf{x}}_s)}\big [\log (p({\textbf{x}}_t\vert {\textbf{z}}_t,{\textbf{z}}_s)\big ]={\mathbb {E}}_{q({\textbf{z}}_t\vert {\textbf{x}}_t)q({\textbf{z}}_s\vert {\textbf{x}}_s)}\big [\log (p({\textbf{x}}_t\vert {\textbf{z}}_s))\big ]\\&= {\mathbb {E}}_{q({\textbf{z}}_s\vert {\textbf{x}}_s)}\big [\log (p({\textbf{x}}_t\vert {\textbf{z}}_s))\big ]={\mathbb {E}}_{q({\textbf{z}}_s\vert {\textbf{x}}_s)}\big [\log (\int p({\textbf{x}}_t\vert {\textbf{z}}_t)p({\textbf{z}}_t\vert {\textbf{z}}_s)d{\textbf{z}}_t)\big ]\\&\qquad \ge {\mathbb {E}}_{q({\textbf{z}}_s\vert {\textbf{x}}_s)}\big [{\mathbb {E}}_{p({\textbf{z}}_t\vert {\textbf{z}}_s)}\left[\log \left(p({\textbf{x}}_t\vert {\textbf{z}}_t)\right)\right]\big ]. \end{aligned} \end{aligned}$$

(A6)

By assuming that

$$\begin{aligned} {\textbf{z}}_t= f({\textbf{z}}_s) \quad \Rightarrow \quad p({\textbf{z}}_t\vert {\textbf{z}}_s) = \delta ({\textbf{z}}_t-f({\textbf{z}}_s)), \end{aligned}$$

(A7)

then, the term (3) in Eq. (A4) can be redefined as

$$\begin{aligned} \begin{aligned} (3)&= {\mathbb {E}}_{q({\textbf{z}}_t\vert {\textbf{x}}_t)q({\textbf{z}}_s\vert {\textbf{x}}_s)}[\log (p({\textbf{z}}_t\vert {\textbf{z}}_s)p({\textbf{z}}_s))]\\&= {\mathbb {E}}_{q({\textbf{z}}_t\vert {\textbf{x}}_t)q({\textbf{z}}_s\vert {\textbf{x}}_s)}[\log (p({\textbf{z}}_t\vert {\textbf{z}}_s))]+{\mathbb {E}}_{q({\textbf{z}}_t\vert {\textbf{x}}_t)q({\textbf{z}}_s\vert {\textbf{x}}_s)}[\log (p({\textbf{z}}_s))]\\&= \text {constant}+{\mathbb {E}}_{q({\textbf{z}}_t\vert {\textbf{x}}_t)q({\textbf{z}}_s\vert {\textbf{x}}_s)}[\log (p({\textbf{z}}_s)]. \end{aligned} \end{aligned}$$

(A8)

We also have

$$\begin{aligned} -{\mathbb {E}}_{q({\textbf{z}}_t\vert {\textbf{x}}_t)}\big [\log (q({\textbf{z}}_t\vert {\textbf{x}}_t)] = -{\mathbb {E}}_{q({\textbf{z}}_t\vert {\textbf{x}}_t)}\left[\log (p(f^{-1}({\textbf{z}}_t))+\log \Big \vert \det \frac{\partial f^{-1}}{\partial {\textbf{z}}_t}\Big \vert \right ], \end{aligned}$$

(A9)

Putting all of them together we have the following final loss:

$$\begin{aligned} \log (p({\textbf{x}}_t,{\textbf{x}}_s))\ge & {} {\mathcal {L}}(\varvec{\theta }),\\ \text {where}\quad {\mathcal {L}}(\varvec{\theta })= & {} {\mathbb {E}}_{q({\textbf{z}}_s\vert {\textbf{x}}_s)}[{\mathbb {E}}_{p({\textbf{z}}_t\vert {\textbf{z}}_s)}[\log (p({\textbf{x}}_t\vert {\textbf{z}}_t))]]+{\mathbb {E}}_{q({\textbf{z}}_s\vert {\textbf{x}}_s)}[\log (p({\textbf{x}}_s\vert {\textbf{z}}_s))]\\{} & {} \quad +{\mathbb {E}}_{q({\textbf{z}}_s\vert {\textbf{x}}_s)}[\log (p({\textbf{z}}_s))-\log (q({\textbf{z}}_s\vert {\textbf{x}}_s))]\\{} & {} \quad -{\mathbb {E}}_{q({\textbf{z}}_t\vert {\textbf{x}}_t)}\left[\log (p(f^{-1}({\textbf{z}}_t))+\log \Big \vert \det \frac{\partial f^{-1}}{\partial {\textbf{z}}_t}\Big \vert \right] \end{aligned}$$

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Askari, H., Latif, Y. & Sun, H. MapFlow: latent transition via normalizing flow for unsupervised domain adaptation. Mach Learn 112, 2953–2974 (2023). https://doi.org/10.1007/s10994-023-06357-2

Download citation

Received: 15 June 2022
Revised: 21 December 2022
Accepted: 17 April 2023
Published: 12 July 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s10994-023-06357-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

MapFlow: latent transition via normalizing flow for unsupervised domain adaptation

Abstract

Similar content being viewed by others

Knowledge Distillation: A Survey

Visual Out-of-Distribution Detection in Open-Set Noisy Environments

Towards Task Sampler Learning for Meta-Learning

1 Introduction

2 Related works

2.1 Unsupervised domain adaptation (UDA)

2.2 Normalizing flow

3 Preliminaries

3.1 Notation and problem definition

3.2 Normalizing flow for transformation

4 Motivational insight

5 The MapFlow framework

5.1 Framework for joint distribution

6 Experiments

6.1 Setup

6.1.1 Data sets

6.1.2 Baselines

6.2 Implementation

6.2.1 Architecture

6.2.2 Training settings and hyper-parameters

6.3 Results

6.4 Ablation studies

6.5 Analysis

6.5.1 Qualitative analysis

6.5.2 Target error bound

Theorem 1

7 Conclusion and future work

Availability of data and materials

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendix A derivation of the ELBO

Appendix A derivation of the ELBO

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation