Zeroth- and first-order difference discrimination for unsupervised domain adaptation

Wang, Jie; Chen, Xing; Zhang, Xiao-Lei

doi:10.1007/s40747-023-01283-1

Zeroth- and first-order difference discrimination for unsupervised domain adaptation

Original Article
Open access
Published: 05 December 2023

Volume 10, pages 2569–2584, (2024)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Zeroth- and first-order difference discrimination for unsupervised domain adaptation

Download PDF

Jie Wang^1,2,
Xing Chen^1,2 &
Xiao-Lei Zhang^1,2

578 Accesses
Explore all metrics

Abstract

Unsupervised domain adaptation transfers empirical knowledge from a label-rich source domain to a fully unlabeled target domain with a different distribution. A core idea of many existing approaches is to reduce the distribution divergence between domains. However, they focused only on part of the discrimination, which can be categorized into optimizing the following four objectives: reducing the intraclass distance between domains, enlarging the interclass distances between domains, reducing the intraclass distances within domains, and enlarging the interclass distances within domains. Moreover, because few methods consider multiple types of objectives, the consistency of data representations produced by different types of objectives has not yet been studied. In this paper, to address the above issues, we propose a zeroth- and first-order difference discrimination (ZFOD) approach for unsupervised domain adaptation. It first optimizes the above four objectives simultaneously. To improve the discrimination consistency of the data across the two domains, we propose a first-order difference constraint to align the interclass differences across domains. Because the proposed method needs pseudolabels for the target domain, we adopt a recent pseudolabel generation method to alleviate the negative impact of imprecise pseudolabels. We conducted an extensive comparison with nine representative conventional methods and seven remarkable deep learning-based methods on four benchmark datasets. Experimental results demonstrate that the proposed method, as a conventional approach, not only significantly outperforms the nine conventional comparison methods but is also competitive with the seven deep learning-based comparison methods. In particular, our method achieves an accuracy of 93.4% on the Office+Caltech10 dataset, which outperforms the other comparison methods. An ablation study further demonstrates the effectiveness of the proposed constraint in aligning the objectives.

A survey on semi-supervised learning

Article Open access 15 November 2019

A survey of transfer learning

Article Open access 28 May 2016

Learning from imbalanced data: open challenges and future directions

Article Open access 22 April 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

It is known that machine learning benefits from large-scale manually labeled data. However, manual labeling is often time-consuming and labor-intensive. Dealing with scenarios where manually labeled data are insufficient is a difficult task. A natural thought for this problem is to transfer the labeled to unlabeled data knowledge. However, the mismatch between the data distributions may lead to catastrophic results. To address mismatched distribution problem, much efforts have been devoted to unsupervised domain adaptation (UDA) [1,2,3,4], which transfers empirical knowledge from a label-rich source domain to an unlabeled target domain with a different distribution.

The core UDA issue is to reduce the distribution gap between the source domain and the target domain [5,6,7,8,9]. A common thought for this problem is to find a shared subspace where the source and target data distributions are similar [10,11,12,13]. To learn the subspace, a metric is first needed to evaluate the distribution difference. Common metrics include A-distance [14], KL divergence [15], and maximum mean discrepancy (MMD) [16]. Then, an optimization objective defined on the metric with continuous constraints is used to align the source and target domains. For example, [17] reduced the mean deviation of the domains by learning a linear projection matrix with MMD.

Recently, many methods have aimed to further improve the discrimination of the source and target data as an addition to simply aligning the distributions since the latter may weaken data discrimination [12, 18]. The optimization objectives of discriminative learning methods can be roughly divided into four types: reducing the intraclass distances between domains [19, 20], enlarging the interclass distances between domains [21], reducing the intraclass distances within domains [22, 23], and enlarging the interclass distances within domains [24, 25]. For example, [19] aligned the conditional distributions of the source and target domains by reducing the mean deviating between a class in the source domain and the same class in the target domain, which can be regarded as a method of reducing the intraclass distance between domains. [21] proposed a contrastive domain discrepancy to reduce the intraclass distance and enlarge the interclass distance between domains. [25] takes the maximum distance of the data pairs in the same class and the minimum distance of the data pairs in different classes as regularization terms to align the source and target domains, which can be regarded as a mixture of reducing the intraclass distance between and within domains as well as enlarging the interclass distance within domains. Each type of objective shows effectiveness in improving the discrimination of the data representations to some extent; however, to our knowledge, none of the existing methods consider all four objectives together.

Moreover, if multiple types of discriminative optimization objectives are used together, the effects of different objectives on the data representations may be dramatically different, particularly for tasks with large domain shifts. Figures 1a and 1b illustrate this problem, where Fig. 1a demonstrates the principles of multiple types of discriminative optimization objectives, and Fig. 1b shows the new data distribution generated by the objectives. From Fig. 1b, we can see that the discrimination of the data distributions is significantly improved compared to that in Fig. 1a; however, the discrimination of the source and target data is different. We call this inconsistent domain alignment the discrimination shift, which will finally lead to unsatisfied transfer performance. This problem seems insufficiently explored.

In this paper, we propose a novel UDA algorithm, named zeroth- and first-order difference discrimination (ZFOD), to address the above problems. It first optimizes all four discriminative optimization objectives jointly. They are implemented as follows: (i) The minimization of the intraclass distances between domains is implemented by minimizing the distance between the means of the same class across domains. (ii) The maximization of the interclass distance between domains is implemented by maximizing the distance between the mean of a class in the source domain and the mean of another class in the target domain. (iii) The minimization of the intraclass distance within domains is implemented by minimizing the distance of every two samples in the same class for the source and target domains. (iv) The maximization of the interclass distance within domains is implemented by maximizing the distance of every two samples from different classes of either the source domain or the target domain.

To align the components of the ZFOD objective for first-order difference discrimination of data across the source and target domains, motivated by [26], we propose a novel first-order difference constraint. As illustrated in Fig. 1c, when ZFOD maximizes the interclass distance within domains, the first-order difference constraint aims to constrain the distances in the source and target domains to be the same.

Notably, optimizing the objective of ZFOD needs to obtain pseudolabels for the target domain. However, the accumulation of pseudolabel errors during the iterative optimization process degrades the performance significantly. To alleviate this difficulty, we use a simple and effective method, named mining target domain intraclass similarity to remedy the pseudolabels (TSRP) [27] and improve accuracy. To summarize, the novelty and contribution of the paper is summarized as follows:

We propose jointly optimizing all four discriminative optimization objectives, which is the zeroth-order discrimination of ZFOD.
We propose first-order difference discrimination to align the objective components. The optimization of the discrimination term in each iteration is formulated as a generalized eigenvalue decomposition problem, which has a simple closed-form solution.
We conducted an extensive comparison with nine representative conventional methods [10, 19, 20, 24, 25, 28,29,30,31] and seven remarkable deep learning-based methods [32,33,34,35,36,37,38,39] on four benchmark datasets, including Office+Caltech10 [29], Office-31 [40], ImageCLEF-DA [41], and Office-Home [42]. Experimental results demonstrate the competitiveness of the proposed method with not only conventional comparison methods but also deep learning-based comparison methods.

The remainder of this paper is organized as follows: In Sect. “Related work”, we review some related works. In Sect. “Methods”, we present the proposed ZFOD. The experimental results are reported in Sect. “Experiments”. Finally, we conclude this paper in Sect. “Conclusion”.

Related work

In this section, we first review the methods of learning domain-invariant features and then review the methods that focus on improving data discrimination.

Many domain-invariant feature learning methods have been proposed recently [3, 43], which can be roughly divided into two categories: instance reweighting adaptation methods and feature adaptation methods [3]. Instance reweighting adaptation methods aim to allocate resampling weights directly by feature distribution matching across different domains in a nonparametric manner [22, 44, 45]. For example, [45] proposed an intuitive weighting-based subspace alignment method by reweighting the source samples, which generates a source subspace that is close to a target subspace. [22] reweighted instances by landmark selection so that the pivot samples of the landmarks can be selected as a knowledge transfer bridge, and the outliers can be filtered out. Feature adaptation methods aim to obtain a domain-invariant feature by aligning the domain data distributions [29, 33, 46,47,48,49,50]. For instance, [29] maps both domains into the Grassmann manifold and models the domain shift by constructing geodesic flows. [46] aligned the distributions of source and target domains by aligning their second-order statistics. [47] aligned the distributions indirectly by minimizing the difference between the higher-order central moments of the domains.

In recent years, the idea of generating pseudolabels for the target domain has become popular in UDA [19, 23,24,25]. However, inaccurate pseudolabels may yield unsatisfactory performance with error accumulation during the optimization process. Therefore, some methods aim to improve pseudolabel accuracy [23, 25, 51]. For example, [51] used three asymmetric classifiers to improve pseudolabel accuracy, where two of the classifiers were used to select confident pseudolabels, and the third learned a discriminative data representation for the target domain. [23] proposed generating accurate pseudolabels by selective pseudolabeling and structured prediction. [52] proposed multistage adaptive label filtering to increase correctly labeled target samples.

With the increasing usage of pseudolabels, an increasing number of methods are considering how to improve the discrimination of a data representation while maintaining its domain-invariant property. Here, we list some representative methods that transform the unsupervised target domain into a supervised target domain with pseudolabels. [24] minimizes the distance of each pair of samples in a class and maximizes the distance of any two samples that belong to different classes for the source and target domains. [5] obtained the pseudolabels of the target domain by clustering and then aligned the class centers of the source and target domains for a domain-invariant subspace. [12] proposed a supervised discriminative MMD with pseudolabels to mitigate the degradation of feature discriminability incurred by MMD. [10] developed an ensemble model by a clustering-promoting technology and obtained the final decision of unlabeled target data via majority voting. [23] utilized supervised locality-preserving projection to reduce the distances between class samples across the source and target domains. [53] proposed cross-domain contrastive learning to make samples within the same category close to each other, while samples from different classes lie far apart, regardless of which domain they come from.

The methods mentioned above only consider part of the discrimination. In this paper, we divide the UDA methods for improving data discrimination into four categories. With this observation, we propose a zeroth-order discrimination objective to further improve the discrimination of source and target data and keep the data distributions of the two domains aligned. Inspired by [26], which constrained the distance of any two samples belonging to different classes to a constant, we propose first-order difference discrimination for UDA. Unlike [26], the proposed first-order difference discrimination constrains the interclass distances of the source domain to be as equal as possible to that of the target domain, thereby mitigating the discriminative differences between the two domains.

Methods

In this section, we first present the proposed ZFOD framework in Sect. “Optimization objective”, and then describe its components, which include the interdomain discrimination, intradomain discrimination, first-order difference discrimination, regularization of the projection model, and pseudolabel generation algorithm respectively from Sect. “Interdomain discrimination” to Sect. “Improving the pseudolabels given the solution in (1)”, in detail.

Optimization objective

Assume the source domain contains $n_s$ labeled data points $\{({\textbf{x}}^{i}_s,y^i_s)\}_{i=1}^{n_s}$ and the target domain contains $n_t$ unlabeled data points {${\textbf{x}}^{j}_t\}_{j=1}^{n_t}$ where ${\textbf{x}}^i_s, {\textbf{x}}^{j}_t\in {\mathbb {R}}^{m}$ and $y^i_s\in \{1,2,\ldots ,C\}$ with m denoted as the feature dimension and C denoted as the number of classes. We denote ${\textbf{X}}_s = [{\textbf{x}}^{1}_s,\ldots , {\textbf{x}}^{n_s}_s]$ and ${\textbf{X}}_t = [{\textbf{x}}^{1}_t,\ldots , {\textbf{x}}^{n_t}_t]$. The whole data matrix is ${\textbf{X}} = [{\textbf{X}}_s,{\textbf{X}}_t]\in {\mathbb {R}}^{m\times n}$ with $n=n_s+n_t$. In this paper, we generate and optimize pseudolabels for the target domain. We denote the pseudolabel of the target data ${\textbf{x}}^j_t$ as ${\hat{y}}^j_t$. Our goal is to find a common feature subspace, defined by a projection matrix ${\textbf{P}}\in {\mathbb {R}}^{m\times d}$, such that the new feature representations of the two domains in the new subspace, i.e., ${\textbf{z}}_{s}^i={\textbf{P}}^{\top }{\textbf{x}}_{s}^i $ and ${\textbf{z}}_{t}^j={\textbf{P}}^{\top }{\textbf{x}}_{t}^j$, can be effectively aligned, where d is the feature dimension of the subspace.

The core idea of our ZFOD is (i) the zeroth-order discrimination, which consists of four discriminative optimization objectives and can be roughly divided into two categories: the interdomain discrimination ${\mathcal {L}}_{\textrm{interD}}$ and intradomain discrimination ${\mathcal {L}}_{\textrm{intraD}}$ for improving the data discrimination, and (ii) the first-order difference discrimination ${\mathcal {L}}_{\textrm{FD}}$ for aligning the interclass discrimination of the source and target domains. ZFOD is formulated as the following optimization problem:

$$\begin{aligned} \begin{aligned}&\min _{{\textbf{P}}}{\mathcal {L}}_{\textrm{interD}}+\alpha {\mathcal {L}}_{\textrm{intraD}}+\beta {\mathcal {L}}_{\textrm{FD}}+ \gamma \left\| {\textbf{P}}\right\| _{F}^{2} \end{aligned} \end{aligned}$$

(1)

where the operator $\left\| \cdot \right\| _{F}^{2}$ is the Frobenius norm, and $\left\| {\textbf{P}}\right\| _{F}^{2}$ is a regularization term of ${\textbf{P}}$ to avoid overfitting, $\alpha $, $\beta $ and $\gamma $ are three tradeoff parameters, and ${\mathcal {L}}_{\textrm{interD}}$, ${\mathcal {L}}_{\textrm{intraD}}$, and ${\mathcal {L}}_{\textrm{FD}}$ are the functions of the variables ${\textbf{P}} $ and $\{{\hat{y}}^j_t\}_{j=1}^{n_t}$. For simplicity, we fix $\alpha =1$, which should yield good performance after an empirical investigation on some benchmark datasets.

Interdomain discrimination

The interdomain discrimination objective ${\mathcal {L}}_{\textrm{interD}}$ aligns the distributions of the source samples and target samples, so that the discrimination of the source samples can be transferred to the unlabeled target samples. Given the pseudolabels for the target domain, the interdomain discrimination can be further divided into two parts: interdomain intraclass distance discrimination and interdomain interclass distance discrimination, which are described respectively as follows:

Interdomain intraclass distance discrimination

Interdomain interclass distance discrimination aligns the data distributions of each single class across the two domains as similarly as possible in the subspace, which can be considered a set of C independent UDA problems. We evaluate the interdomain divergence of a class across domains by MMD [16]. The alignment over all classes is formulated as a minimization problem of the following objective:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}^{\textrm{intraC}}_{\textrm{interD}}&=\sum _{k=0}^{C}\left\| \frac{1}{n_{s}^{k}} \sum _{{\textbf{x}}_{s}^i \in {\textbf{X}}_{s}^{k}} {\textbf{z}}_{s}^{i}-\frac{1}{n_{t}^{k}} \sum _{{\textbf{x}}_{t}^j \in {\textbf{X}}_{t}^{k}} {\textbf{z}}_{t}^j\right\| ^{2}\\&=\sum _{k=0}^{C} {\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{M}}_{k} {\textbf{X}}^{\top } {\textbf{P}}\right) \\&={\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{M}}^{\textrm{intraC}} {\textbf{X}}^{\top } {\textbf{P}}\right) \end{aligned}\nonumber \\ \end{aligned}$$

(2)

where ${\textbf{X}}_{s}^{k}$ denotes the source samples of class k and ${\textbf{X}}_{t}^{k}$ denotes the target samples with their pseudoclass labels set to k, and $n_{s}^{k}$ and $n_{t}^{k}$ are the number of the source and target samples of class k, respectively; particularly, the class $k=0$ denotes the data of the whole source or target domain; ${\textbf{M}}_{k}$ is the intraclass matrix of class k across domains with its element $\left( {\textbf{M}}_{k}\right) _{i j}$ defined as:

$$\begin{aligned} \left( {\textbf{M}}_{k}\right) _{i j}= {\left\{ \begin{array}{ll}\frac{1}{n_{s}^{k} n_{s}^{k}}, &{} \forall {\textbf{x}}^{i}, {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{k} \\ \frac{1}{n_{t}^{k} n_{t}^{k}}, &{} \forall {\textbf{x}}^{i}, {\textbf{x}}^{j} \in {\textbf{X}}_{t}^{k} \\ -\frac{1}{n_{s}^{k} n_{t}^{k}}, &{} \forall {\textbf{x}}^{i} \in {\textbf{X}}_{s}^{k}, {\textbf{x}}^{j} \in {\textbf{X}}_{t}^{k}\\ -\frac{1}{n_{s}^{k} n_{t}^{k}}, &{} \forall {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{k}, {\textbf{x}}^{i} \in {\textbf{X}}_{t}^{k}\\ 0, &{} \text{ otherwise } \end{array}\right. } \end{aligned}$$

(3)

and ${\textbf{M}}^{\textrm{intraC}} = \sum _{k=0}^C{\textbf{M}}_{k}$. Note that ${\textbf{M}}_{k}$ is also known as the class-conditional MMD matrix for class k.

Interdomain interclass distance discrimination

Interdomain interclass distance discrimination improves cross-domain data discrimination such that each single class in a domain is far apart from the other classes in the other domain. The interdomain divergence is evaluated by MMD as well. Unlike the problem in Sect. “Interdomain intraclass distance discrimination”, the interdomain interclass distance discrimination is formulated as a maximization of the following objective:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}^{\textrm{interC}}_{\textrm{interD}}&=\sum _{k=1}^{C-1}\sum _{q=k+1}^{C}\left\| \frac{1}{n_{s}^{k}} \sum _{{\textbf{x}}_{s}^i \in {\textbf{X}}_{s}^{k}} {\textbf{z}}_{s}^{i}-\frac{1}{n_{t}^{q}} \sum _{{\textbf{x}}_{t}^j \in {\textbf{X}}_{t}^{q}} {\textbf{z}}_{t}^{j}\right\| ^{2}\\&=\sum _{k=1}^{C-1}\sum _{q=k+1}^{C} {\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{M}}_{kq} {\textbf{X}}^{\top } {\textbf{P}}\right) \\&={\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{M}}^{\textrm{interC}} {\textbf{X}}^{\top } {\textbf{P}}\right) \end{aligned}\nonumber \\ \end{aligned}$$

(4)

where ${\textbf{M}}_{kq}$ is the interclass matrix of class k in the source domain and pseudoclass q in the target domain:

$$\begin{aligned} \left( {\textbf{M}}_{kq}\right) _{i j}= {\left\{ \begin{array}{ll}\frac{1}{n_{s}^{k} n_{s}^{k}}, &{} \forall {\textbf{x}}^{i}, {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{k} \\ \frac{1}{n_{t}^{q} n_{t}^{q}}, &{} \forall {\textbf{x}}^{i}, {\textbf{x}}^{j} \in {\textbf{X}}_{t}^{q} \\ -\frac{1}{n_{s}^{k} n_{t}^{q}}, &{} \forall {\textbf{x}}^{i} \in {\textbf{X}}_{s}^{k}, {\textbf{x}}^{j} \in {\textbf{X}}_{t}^{q} \\ -\frac{1}{n_{s}^{k} n_{t}^{q}}, &{} \forall {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{k}, {\textbf{x}}^{i} \in {\textbf{X}}_{t}^{q} \\ 0, &{} \text{ otherwise } \end{array}\right. }. \end{aligned}$$

(5)

and ${\textbf{M}}^{\textrm{interC}} = \sum _{k=1}^{C-1}\sum _{q=k+1}^{C}{\textbf{M}}_{kq}$.

We regard that intraclass and interclass discrimination as equivalently important. Eventually, the interdomain discrimination is formulated as:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{\textrm{interD}}&={\mathcal {L}}^{\textrm{intraC}}_{\textrm{interD}}- {\mathcal {L}}^{\textrm{interC}}_{\textrm{interD}} \\&={\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{M}} {\textbf{X}}^{\top } {\textbf{P}}\right) \end{aligned} \end{aligned}$$

(6)

where ${\textbf{M}}$ represents the interdomain discrimination matrix:

$$\begin{aligned} {\textbf{M}}={\textbf{M}}^{\textrm{intraC}}-{\textbf{M}}^{\textrm{interC}}. \end{aligned}$$

(7)

Intradomain discrimination

The intradomain discrimination objective ${\mathcal {L}}_{\textrm{intraD}}$ improves the data discrimination at each domain, which can be divided into two parts: intradomain intraclass distance discrimination and intradomain interclass distance discrimination.

Intradomain intraclass discrimination

Intradomain intraclass discrimination reduces the intraclass distance in each domain. Here we define the intraclass distance as the average distance of pairwise samples in a class, which results in a minimization problem of the following objective:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}^{\textrm{intraC}}_{\textrm{intraD}}&=\sum _{k=1}^{C} \frac{n_{s}}{n_{s}^{k}} \sum _{{\textbf{x}}_{s}^i, {\textbf{x}}_{s}^j\in {\textbf{X}}_s^k}\left\| {\textbf{z}}_{s}^i-{\textbf{z}}_{s}^j\right\| ^{2}\\&\quad +\sum _{k=1}^{C} \frac{n_{t}}{n_{t}^{k}} \sum _{{\textbf{x}}_{t}^i, {\textbf{x}}_{t}^j\in {\textbf{X}}_t^k}\left\| {\textbf{z}}_{t}^i-{\textbf{z}}_{t}^j\right\| ^{2} \\&={\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}}_{s} {\textbf{W}}_{s}^{\textrm{intraC}} {\textbf{X}}_{s}^{\top } {\textbf{P}}\right) +{\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}}_{t} {\textbf{W}}_{t}^{\textrm{intraC}} {\textbf{X}}_{t}^{\top } {\textbf{P}}\right) \\&={\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{W}}^{\textrm{intraC}} {\textbf{X}}^{\top } {\textbf{P}}\right) \end{aligned} \end{aligned}$$

(8)

where ${\textbf{W}}_{s}^{\textrm{intraC}}$ and ${\textbf{W}}_{t}^{\textrm{intraC}}$ are the intraclass matrices of the source domain and target domain respectively:

$$\begin{aligned} \left( {\textbf{W}}_{s}^{\textrm{intraC}}\right) _{i j}= {\left\{ \begin{array}{ll}n_{s}, &{} \text{ if } i=j \\ -\frac{n_{s}}{n_{s}^{k}}, &{} \text{ if } i \ne j \\ 0, &{} \text{ otherwise }\end{array}\right. },\quad \forall {\textbf{x}}_s^i, {\textbf{x}}_s^j\in {\textbf{X}}_s^k \nonumber \\ \end{aligned}$$

(9)

$$\begin{aligned} \left( {\textbf{W}}_{t}^{\textrm{intraC}}\right) _{i j}= {\left\{ \begin{array}{ll}n_{t}, &{} \text{ if } i=j \\ -\frac{n_{t}}{n_{t}^{k}}, &{} \text{ if } i \ne j \\ 0, &{} \text{ otherwise } \end{array}\right. },\quad \forall {\textbf{x}}_t^i, {\textbf{x}}_t^j\in {\textbf{X}}_t^k \nonumber \\ \end{aligned}$$

(10)

and ${\textbf{W}}^{\textrm{intraC}}={\text {diag}}\left( {\textbf{W}}_{s}^{\textrm{intraC}},{\textbf{W}}_{t}^{\textrm{intraC}}\right) $.

Intradomain interclass discrimination

Intradomain interclass discrimination enlarges the interclass distance in each domain. Here, we define the interclass distance as the average of the distances of any pair of samples that belong to different classes in a domain, which results in a maximization problem of the following objective:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}^{\textrm{interC}}_{\textrm{intraD}}&=\sum _{y_{s}^i \ne y_{s}^j}\left\| {\textbf{z}}_{s}^i-{\textbf{z}}_{s}^ j\right\| ^{2}+\sum _{{\hat{y}}_{t}^i \ne {\hat{y}}_{t}^j}\left\| {\textbf{z}}_{t}^i-{\textbf{z}}_{t}^j\right\| ^{2} \\&={\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}}_{s} {\textbf{W}}_{s}^{\textrm{interC}} {\textbf{X}}_{s}^{\top } {\textbf{P}}\right) +{\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}}_{t} {\textbf{W}}_{t}^{\textrm{interC}} {\textbf{X}}_{t}^{\top } {\textbf{P}}\right) \\&={\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{W}}^{\textrm{interC}} {\textbf{X}}^{\top } {\textbf{P}}\right) \end{aligned} \nonumber \\ \end{aligned}$$

(11)

where ${\textbf{W}}_{s}^{\textrm{interC}}$ and ${\textbf{W}}_{t}^{\textrm{interC}}$ are the interclass matrices of the source domain and target domain, respectively:

$$\begin{aligned} \left( {\textbf{W}}_{s}^{\textrm{interC}}\right) _{i j}= {\left\{ \begin{array}{ll}n_{s}-n_{s}^{k}, &{} \text{ if } i=j \text { and } y_{s}^i=k \\ -1, &{} \text{ if } i \ne j \text { and } y_{s}^i \ne y_{s}^j \\ 0, &{} \text{ otherwise } \end{array}\right. } \end{aligned}$$

(12)

$$\begin{aligned} \left( {\textbf{W}}_{t}^{\textrm{interC}}\right) _{i j}= {\left\{ \begin{array}{ll}n_{t}-n_{t}^{k}, &{} \text{ if } i=j \text { and } {\hat{y}}_{t}^i=k \\ -1, &{} \text{ if } i \ne j \text { and } {\hat{y}}_{t}^i \ne {\hat{y}}_{t}^j \\ 0, &{} \text{ otherwise } \end{array}\right. } \end{aligned}$$

(13)

and ${\textbf{W}}^{\textrm{interC}}={\text {diag}}\left( {\textbf{W}}_{s}^{\textrm{interC}},{\textbf{W}}_{t}^{\textrm{interC}}\right) $.

Finally, we formulate the intradomain discrimination from (8) and (11) as follows:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{\textrm{intraD}}&={\mathcal {L}}^{\textrm{intraC}}_{\textrm{intraD}}-\rho {\mathcal {L}}^{\textrm{interC}}_{\textrm{intraD}}\\&={\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{W}}^{\textrm{intraC}} {\textbf{X}}^{\top } {\textbf{P}}\right) -\rho {\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{W}}^{\textrm{interC}} {\textbf{X}}^{\top } {\textbf{P}}\right) \\&={\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{W}} {\textbf{X}}^{\top } {\textbf{P}}\right) \end{aligned} \nonumber \\ \end{aligned}$$

(14)

where $\rho $ is a hyperparameter for balancing the two terms, and ${\textbf{W}}$ is the intradomain discrimination matrix:

$$\begin{aligned} {\textbf{W}}={\textbf{W}}^{\textrm{intraC}}-\rho {\textbf{W}}^{\textrm{interC}} \end{aligned}$$

(15)

First-order difference discrimination

Definition

The first-order difference discrimination objective ${\mathcal {L}}_{FD}$ constrains the distance between any two classes of a domain to be similar to the distance of the same pair of classes of another domain, which is formulated as a minimization problem of the following objective:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{\textrm{FD}}&= \sum _{k=1}^{C-1}\sum _{q=k+1}^{C}\left\| {\frac{1}{n_{s}^{k}} \sum _{{\textbf{x}}_{s}^i \in {\textbf{X}}_{s}^{k}} {\textbf{z}}_{s}^{i}-\frac{1}{n_{s}^{q}} \sum _{{\textbf{x}}_{s}^j \in {\textbf{X}}_{s}^{q}} {\textbf{z}}_{s}^j}\right. \\&\quad \left. {-\frac{1}{n_{t}^{k}} \sum _{{\textbf{x}}_{t}^i \in {\textbf{X}}_{t}^{k}} {\textbf{z}}_{t}^i+\frac{1}{n_{t}^{q}} \sum _{{\textbf{x}}_{t}^j \in {\textbf{X}}_{t}^{q}} {\textbf{z}}_{t}^j}\right\| ^{2}\\&=\sum _{k=1}^{C-1}\sum _{q=k+1}^{C} {\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{S}}_{kq} {\textbf{X}}^{\top } {\textbf{P}}\right) \\&={\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{S}} {\textbf{X}}^{\top } {\textbf{P}}\right) \end{aligned} \nonumber \\ \end{aligned}$$

(16)

where

$$\begin{aligned} \left( {\textbf{S}}_{kq}\right) _{i j}= {\left\{ \begin{array}{ll}\frac{1}{n_{s}^{k} n_{s}^{k}}, &{} \forall {\textbf{x}}^{i}, {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{k} \\ \frac{1}{n_{s}^{q} n_{s}^{q}}, &{} \forall {\textbf{x}}^{i}, {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{q} \\ \frac{1}{n_{t}^{k} n_{t}^{k}}, &{} \forall {\textbf{x}}^{i}, {\textbf{x}}^{j} \in {\textbf{X}}_{t}^{k} \\ \frac{1}{n_{t}^{q} n_{t}^{q}}, &{} \forall {\textbf{x}}^{i}, {\textbf{x}}^{j} \in \hat{{\textbf{X}}}_{t}^{q} \\ \frac{1}{n_{s}^{k} n_{t}^{q}}, &{} \forall {\textbf{x}}^{i} \in {\textbf{X}}_{s}^{k}, {\textbf{x}}^{j} \in {\textbf{X}}_{t}^{q} \ \text {or} \ \forall {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{k}, {\textbf{x}}^{i} \in {\textbf{X}}_{t}^{q}\\ \frac{1}{n_{s}^{q} n_{t}^{k}}, &{} \forall {\textbf{x}}^{i} \in {\textbf{X}}_{s}^{q}, {\textbf{x}}^{j} \in {\textbf{X}}_{t}^{k} \ \text {or} \ \forall {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{q}, {\textbf{x}}^{i} \in {\textbf{X}}_{t}^{k}\\ -\frac{1}{n_{s}^{k} n_{s}^{q}}, &{} \forall {\textbf{x}}^{i} \in {\textbf{X}}_{s}^{k}, {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{q} \ \text {or} \ \forall {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{k}, {\textbf{x}}^{i} \in {\textbf{X}}_{s}^{q}\\ -\frac{1}{n_{s}^{k} n_{t}^{k}}, &{} \forall {\textbf{x}}^{i} \in {\textbf{X}}_{s}^{k}, {\textbf{x}}^{j} \in {\textbf{X}}_{t}^{k} \ \text {or} \ \forall {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{k}, {\textbf{x}}^{i} \in {\textbf{X}}_{t}^{k}\\ -\frac{1}{n_{s}^{q} n_{t}^{q}}, &{} \forall {\textbf{x}}^{i} \in {\textbf{X}}_{s}^{q}, {\textbf{x}}^{j} \in {\textbf{X}}_{t}^{q} \ \text {or} \ \forall {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{q}, {\textbf{x}}^{i} \in {\textbf{X}}_{t}^{q}\\ -\frac{1}{n_{t}^{k} n_{t}^{q}}, &{} \forall {\textbf{x}}^{i} \in {\textbf{X}}_{t}^{k}, {\textbf{x}}^{j} \in {\textbf{X}}_{t}^{q} \ \text {or} \ \forall {\textbf{x}}^{j} \in {\textbf{X}}_{t}^{k}, {\textbf{x}}^{i} \in {\textbf{X}}_{t}^{q}\\ 0, &{} \text{ otherwise } \end{array}\right. }. \nonumber \\ \end{aligned}$$

(17)

and ${\textbf{S}}$ represents the first-order difference discrimination matrix:

$$\begin{aligned} {\textbf{S}}=\sum _{k=1}^{C-1}\sum _{q=k+1}^{C}{\textbf{S}}_{kq} \end{aligned}$$

(18)

Discussion

The first-order difference discrimination ${\mathcal {L}}_{\textrm{FD}}$ is designed to remedy the weakness of the interdomain discrimination ${\mathcal {L}}_{\textrm{interD}}$.

Specifically, when the data distributions are nonuniform, the large intraclass distance components in ${\mathcal {L}}^{\textrm{intraC}}_{\textrm{interD}}$ and small interclass distance components of ${\mathcal {L}}^{\textrm{interC}}_{\textrm{interD}}$ contribute to the main reduction of the objective value $\min _{{\textbf{P}}} {\mathcal {L}}_{\textrm{interD}}$ during the optimization process, which makes the solution ${\textbf{P}}$ biased toward these components, and against the other components. As a result, the classes that do not benefit much from $\min _{{\textbf{P}}} {\mathcal {L}}_{\textrm{interD}}$ tend to be not aligned well across domains. In other words, the knowledge of those classes in the source domain may not transfer well to the target domain.

If we calculate the components of ${\mathcal {L}}_{\textrm{FD}}$, we find that the classes that are not well aligned tend to make the values of the corresponding components large, while the classes that are well aligned yield small values. Therefore, we propose ${\mathcal {L}}_{\textrm{FD}}$ as a supplement to ${\mathcal {L}}_{\textrm{interD}}$.

Optimization algorithm

The optimization of ZFOD alternately conducts the following steps: (i) solving problem (1) assigning the pseudolabels and (ii) improving the pseudolabels with the solution of (1). See the following subsections for the two steps. See Algorithm 1 for a summarization of ZFOD.

Solving problem (1) given the pseudolabels

Substituting (6), (14), and (16) into (1) results in the following optimization problem:

$$\begin{aligned} \begin{array}{cl} \min \limits _{{\textbf{P}}} &{} {\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} \left( {\textbf{M}}+{\textbf{W}}+\beta {\textbf{S}}\right) {\textbf{X}}^{\top } {\textbf{P}}\right) +\gamma \left\| {\textbf{P}}\right\| _F^2 \\ \text{ s.t. } &{} {\textbf{P}}^{\top } \textbf{X H} {\textbf{X}}^{\top } {\textbf{P}}={\textbf{I}}_{d} \end{array} \end{aligned}$$

(19)

where the constraint maximizes the embedded data variance as [24] and [19] did, ${\textbf{I}}_{d}$ is an identify matrix of dimension d and ${\textbf{H}}={\textbf{I}}_{(n_s+n_t)}-(1/(n_s+n_t)){\textbf{1}}_{(n_s+n_t)\times (n_s+n_t)}$.

The Lagrangian function of problem (19) is:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}({\textbf{P}}, \Theta )=&{\text {Tr}}\left( {\textbf{P}}^{\top }\left( {\textbf{X}} \left( {\textbf{M}}+{\textbf{W}}+\beta {\textbf{S}}\right) {\textbf{X}}^{\top }+\gamma {\textbf{I}}_{m}\right) {\textbf{P}}\right) \\&+{\text {Tr}}\left( \left( {\textbf{I}}_{d}-{\textbf{P}}^{\top } {\textbf{X}} {\textbf{H}} {\textbf{X}}^{\top } {\textbf{P}}\right) \Theta \right) \end{aligned} \end{aligned}$$

(20)

where $\Theta ={\text {diag}}(\theta _1,\theta _2,...,\theta _d) \in {\mathbb {R}}^{d\times d}$ is a diagonal matrix with Lagrange multipliers. Solving $\partial {\mathcal {L}}({\textbf{P}}, \Theta )/\partial {\textbf{P}} = 0$ derives the following optimal solution of (19):

$$\begin{aligned} {\textbf{P}}^{\star }=\left( {\textbf{X}} \left( {\textbf{M}}+{\textbf{W}}+\beta {\textbf{S}}\right) {\textbf{X}}^{\top }+\gamma {\textbf{I}}_{m}\right) ^{-1}\textbf{X H} {\textbf{X}}^{\top } {\textbf{P}}\Theta \end{aligned}$$

(21)

which is a generalized eigenvalue decomposition problem. In practice, we select the generalized eigenvectors of the right side of (21) corresponding to the d-smallest eigenvalues as the final ${\textbf{P}}^{\star }$.

Note that problem (19) can be easily generalized into the following kernel form through a kernel mapping $\phi : {\textbf{X}} \rightarrow \phi ({\textbf{X}})$:

$$\begin{aligned} \begin{array}{cl} \min \limits _{{\textbf{P}}} &{} {\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{K}} \left( {\textbf{M}}+{\textbf{W}}+\beta {\textbf{S}}\right) {\textbf{K}}^{\top } {\textbf{P}}\right) +\gamma \left\| {\textbf{P}}\right\| _F^2 \\ \text{ s.t. } &{} {\textbf{P}}^{\top } \textbf{K H} {\textbf{K}}^{\top } {\textbf{P}}={\textbf{I}}_{d} \end{array} \end{aligned}$$

(22)

where ${\textbf{K}}=\phi ({\textbf{X}})^{\top } \phi ({\textbf{X}})$. It has a similar solution to (21).

Improving the pseudolabels given the solution in (1)

Inaccurate pseudolabels of the target domain result in suboptimal performance. To alleviate this problem, we use our previous work, named TSRP [27], to correct the errors of the pseudolabels. For the integrity of the paper, we present the TSRP algorithm briefly as follows:

As shown in Fig. 2, the TSRP first generates coarse pseudolabels for the target domain by a classifier $f_{\textrm{init}}(\cdot )$ trained on the source samples $\{({\textbf{z}}^{i}_s,y^i_s)\}_{i=1}^{n_s}$, and calculates the similarity of the samples in each pseudoclass as in Fig. 2a. Then, for each pseudoclass, it removes the samples with low confidence, as in Fig. 2a, b, where the confidence is defined as

$$\begin{aligned} S_{i,j}^{k}= \left\{ \begin{array}{ll} \frac{\langle {\textbf{z}}_{t}^{i}, {\textbf{z}}_{t}^{j}\rangle }{\Vert {\textbf{z}}_{t}^{i}\Vert \Vert {\textbf{z}}_{t}^{j}\Vert }, &{} i\ne j, {\hat{y}}_{t}^i = {\hat{y}}_{t}^j = k\\ 0, &{} \text {otherwise} \\ \end{array}\right. , \forall k=1,2,\ldots ,C \nonumber \\ \end{aligned}$$

(23)

Next, as shown as in the operation step from Fig. 2b–c, for each pseudoclass, TSRP connects the samples by spanning trees and then selects the samples of a spanning tree whose root sample has the maximum degree. In this way, the TSRP can further eliminate the negative effect of highly confident misclassified samples. Finally, it uses the selected target-domain samples with high confidence and the source-domain samples together to train a final classifier $f_{\textrm{final}}(\cdot )$, which is used to generate the refined pseudolabels of the remaining samples with low confidence.

Computational complexity

Assume ZFOD needs T iterations to converge; then the overall computational complexity of ZFOD is ${\mathcal {O}}(T((C^2+4)(n_s+n_t)^2+dm^2))$, which is proved as follows:

As shown in Algorithm 1, for each iteration, the computational complexity of ZFOD mainly consists of the following five parts:

Calculating the interdomain discrimination matrix ${\textbf{M}}$ spends ${\mathcal {O}}(0.5(C^2+C+2)(n_s+n_t)^2)$ time.
Calculating the intradomain discrimination matrix ${\textbf{W}}$ spends ${\mathcal {O}}(2(n_s+n_t)^2)$ time.
Calculating the first-order difference discrimination matrix ${\textbf{S}}$ spends ${\mathcal {O}}(0.5(C^2-C)(n_s+n_t)^2)$ time.
Solving the generalized eigen-decomposition problem takes ${\mathcal {O}}(dm^2)$ time.
TSRP costs ${\mathcal {O}}((n_s+n_t)^2)$.

As the experiments show, because T and the optimal d are usually small numbers, ZFOD can be solved in a polynomial time with respect to the number of data samples.

Storage complexity

The overall storage complexity of ZFOD is ${\mathcal {O}}((n_s+n_t)^2)$, which is proved as follows:

Storing the source and target data, which require an ${\mathcal {O}}((n_s+n_t))$ space.
Calculating the interdomain discrimination matrix ${\textbf{M}}$, which require an ${\mathcal {O}}((n_s+n_t)^2)$ space.
Calculating the intradomain discrimination matrix ${\textbf{W}}$, which require an ${\mathcal {O}}(n_s^2+n_t^2)$ space.
Calculating the first-order difference discrimination matrix ${\textbf{S}}$, which require an ${\mathcal {O}}((n_s+n_t)^2)$ space.

Table 1 Computational and storage complexities of the comparison methods

Full size table

Experiments

In this section, we evaluate the performance of the proposed ZFOD on several popular visual cross-domain benchmarks. The source code of ZFOD is available at https://github.com/02Bigboy/ZFOD.

Datasets and cross-domain tasks

The experiments were performed on 4 datasets. The detailed information of the datasets are introduced as follows:

Office+Caltech10 [29] contains four domains: Amazon, Webcam, DSLR and Caltech-256, which share the same 10 classes. The dataset has 2533 images in total.

Office-31 [40] consists of three domains: Amazon (A), Webcam (W) and DSLR (D). It contains 4110 images with 31 categories in total.

ImageCLEF-DA [41] has three domains: Caltech-256 (C), ImageNet ILSVRC 2012 (I), and Pascal VOC 2012 (P). It contains 12 classes, each of which has 50 images from three domains.

Office-Home [42] includes 65 object classes from four domains, i.e., Artistic images (A), Clipart (C), Product images (P) and Real-World images (R). There are a total of 15,588 images.

Table 2 Average classification accuracy (%) of the comparison methods on the target domains of the Office+Caltech10 dataset, where A = Amazon, C = Caltech, D = DSLR And W = Webcam. The highest accuracy on a cross-domain task is marked in bold

Full size table

For the Office+Caltech10 dataset, we used the 4096-dimensional DECAF-6 feature [54]. For the other datasets, we used the 2048-dimensional ResNet-50 feature [32].

Experimental settings

For the proposed ZFOD, we applied the following parameter setting to all comparison experiments. Specifically, we set the hyperparameters $\beta =1.5$, $\gamma =1$, and $\rho =0.1$. We set the dimension of the subspace d to 100, and limited the number of the optimization iterations to be no larger than 10, i.e., $N=10$. As the ablation study shows, ZFOD is insensitive to the hyperparameter setting.

To evaluate the effectiveness of the proposed algorithm, nine representative conventional methods and seven remarkable deep learning-based methods were compared.

Conventional UDA methods

we compare nine conventional UDA methods with ZFOD:

Classic UDA methods:
1. 1.
  1-nearest neighbor classifier (1NN) [28].
2. 2.
  Geodesic flow kernel for domain adaptation (GFK) [29].
3. 3.
  Joint geometrical and statistical alignment (JGSA) [30].
4. 4.
  Manifold embedded distribution alignment (MEDA) [20].
Method of reducing the intraclass distance between domains:
1. 1.
  Joint distribution adaptation (JDA) [19].
Method of reducing the intraclass distance between domains and reducing the intraclass distance within domains.
1. 1.
  Domain-irrelevant class clustering (DICE) [10].
Method of reducing the intraclass distance and enlarging the interclass distances within domains.
1. 1.
  Minimum centroid shift (MCS) [31].
Methods of reducing the intraclass distance between domains, enlarging the interclass distances within domains and reducing the intraclass distances within domains.
1. 1.
  Discriminatve transfer feature and label consistency (DTLC) [25].
2. 2.
  Domain invariant and class discriminative (DICD) [24].
3. 3.
  Target similarity for pseudolabels (TSRP) [27].

Note that the proposed ZFOD is closely related to DICD. When ZFOD removes the interclass distance across domains and first-order difference discrimination, ZFOD degrades into DICD.

We list the computational and storage complexities of the proposed methods and some comparison methods in Table 1. The table shows that when the number of classes (C) is small, the complexity of the proposed method is similar to those of the comparison methods.

Deep learning-based UDA methods

They are the deep adaptation networks (DAN) [33], residual transfer networks (RTN) [34], multiadversarial DA (MADA) [35], conditional domain adversarial network(CDAN) [36], joint adaptation networks (JAN) [37], Collaborative and adversarial network (iCAN) [38], and maximum classifier discrepancy (MCD) [39], respectively. Note that we also added Resnet-50 [32], which use the classifier trained on the source domain to the test domain directly without applying a specially designed UDA algorithm as a baseline.

Table 3 Average classification accuracy (%) of the comparison methods on the target domains of the Office-31 and ImageCLEF-DA datasets

Full size table

Table 4 Average classification accuracy (%) of the comparison methods on the target domains of the Office-Home dataset

Full size table

The classification accuracy on the target domain is used as the evaluation metric. For a fair comparison, the reported results of the comparison methods were either from their original papers or produced by the publicly available codes of the methods.

Main results

Table 2 lists the classification accuracy of the comparison methods on the domain adaptation tasks of the Office+ Caltech10 dataset. The table shows that ZFOD yields the highest average accuracy, which achieves an absolute improvement of $0.4\%$ over the best competitor DTLC. For the results on each domain adaptation task, ZFOD achieves the highest accuracy on 6 out of 12 tasks.

In particular, the average accuracy of ZFOD is $3.5\%$ higher than that of the closely related comparison method DICD. If we look at the tasks in detail, we find that ZFOD is superior to DICD for all tasks except for “D$\rightarrow $W”. For example, the accuracy of ZFOD over DICD is improved from $83.4\%$ to $95.5\%$ on the “A$\rightarrow $D” task, and from $93.6\%$ to $98.7\%$ on the “C$\rightarrow $D” task.

Table 3 shows the classification accuracy of the comparison methods on the Office-31 and ImageCLEF-DA datasets. From the table, we see that ZFOD reaches an average accuracy of $88.1\%$, which is $0.2\%$ higher than the runner-up comparison method DTLC. It achieves the best results on 3 out of 12 tasks. It is worthy mentioning that ZFOD is superior to DICD in all tasks, and reaches an average accuracy of $3.5\%$ higher than that of DICD. Especially for the “A$\rightarrow $D” task of the Office-31 dataset and the “P$\rightarrow $I” task of the ImageCLEF-DA dataset, the accuracy has been increased from 81.7$\%$ and 81.2$\%$ to 91.0$\%$ and 91.5$\%$, respectively. Besides, ZFOD outperforms the deep learning-based methods. Its average accuracy is $0.8\%$ higher than the best deep model iCAN.

Table 4 lists the classification accuracy of the comparison methods on the Office-home dataset. From the table, we observe similar phenomena as those on the other datasets. Specifically, ZFOD reaches the highest average accuracy, which is $0.8\%$ higher than the best referenced method EeasyTL. It achieves the highest accuracy on 3 out of 12 tasks, while EeasyTL wins none of the tasks. Compared to the deep learning-based methods, ZFOD outperforms all deep methods with an absolute accuracy improvement of at least $1.6\%$. For example, ZFOD outperforms CDAN in 7 out of 12 tasks. The accuracy is improved from 66.0$\%$ to 70.8$\%$ on the “Cl$\rightarrow $Pr” task, and from 55.6$\%$ to 62.5$\%$ on the “Pr$\rightarrow $Ar” task.

Table 5 Properties of the comparison methods

Full size table

The above phenomena, which demonstrate the advantage of ZFOD, can be explained as follows: ZFOD not only maximizes the data discrimination in four aspects but also aligns the data distributions of the two domains by the so-called first-order difference discrimination, which may result in better domain-invariant and discriminative representations than the comparison methods. In contrast, some excellent comparison methods, such as MCS, DTLC, and DICD, only maximize part of the data discrimination. Their interdomain alignment is also limited to the data and do not refer to the levels of high-order data discrimination. See Table 5 for a summary of the differences between ZFOD and some representative conventional methods that also optimize the data discrimination to some extent.

Ablation study

ZFOD contains three novel points: TSRP-based pseudolabel generation, first-order difference discrimination, and zeroth-order discrimination. In this subsection, we conducted an ablation study by removing the novel points from ZFOD one by one. The ZFOD without the TSRP-based pseudolabel generation method is denoted as “ZFOD without TSRP”. The ZFOD without the first two novel points is denoted as “ZFOD without TSRP and FOD”. If we remove all three novel points, the most closely related method is DICD [24]. When the zeroth-order discrimination removes the interclass distance across domains, then the remaining three discrimination subitems are the same type of subitems as those in DICD. Therefore, we compared the above ZFOD variants with DICD.

The experiments were conducted on the four UDA datasets. For each dataset, we randomly selected two tasks as representatives. The comparison results are listed in Table 6 and analyzed as follows:

Effect of the TSRP-based pseudolabel generation method on performance

From the comparison results between “ZFOD without TSRP” and ZFOD, we see that the performance of “ZFOD without TSRP” is apparently worse than ZFOD, which indicates that the TSRP-based pseudolabel generation method can improve the accuracy of the pseudolabels, and the correctness of the pseudolabels is very important for ZFOD to learn a domain-invariant and discriminative feature.

Effect of first-order discrimination on performance

From the comparison results between “ZFOD without TSRP” and “ZFOD without TSRP and FOD”, we see that the former shows slightly better or at least similar performance with the latter on the tasks of Office+Caltech10, Office-31 and ImageCLEF-DA and significantly outperforms the latter on the tasks of Office-home, which shows the effectiveness of the first-order discrimination and that the first-order discrimination can further help align the distribution of the source domain and target domain.

Effect of zeroth-order discrimination on performance

From the comparison results between “ZFOD without TSRP and FOD” and DICD, we find that the zeroth-order discrimination, which has an extra interdomain interclass distance discrimination subitem beyond DICD, outperforms DICD on Office+Caltech10, Office-31 and ImageCLEF-DA. However, it is significantly worse than DICD on Office-Home, which indicates that the advantage of combining all four subitems of the zeroth-order discrimination over the subset of the subitems is not guaranteed due to the discrimination inconsistency problem analyzed in Sect. “Discussion”.

After comparing the results between “ZFOD without TSRP and FOD” and DICD, we find that there is no guarantee that “ZFOD without TSRP and FOD” is always better than DICD due to the discrimination inconsistency problem. However, we further find that “ZFOD without TSRP” outperforms DICD apparently in all tasks, which indicates that the first-order discrimination overcomes the discrimination inconsistency problem of the zeroth-order discrimination.

Table 6 Ablation study for the components of ZFOD

Full size table

Effects of hyperparameters on performance

ZFOD has four hyperparameters $\beta $, $\gamma $, $\rho $, and d. In the previous sections, we used the same hyperparameter setting in all experiments. To study whether the performance of ZFOD is sensitive to different hyperparameter settings, we conducted a grid search for each parameter. During the grid search of one parameter, we fixed the other parameters to their default values. The experiments were conducted on randomly selected tasks of the four UDA datasets: “C $\rightarrow $ W (Office-Caltech10)”, “P $\rightarrow $ I (ImageCLEF-DA)”, “A $\rightarrow $ W (Office-31)” and “Ar $\rightarrow $ Pr (Office-Home)”. The results are shown in Fig. 3 and analyzed as follows:

The hyperparameter $\beta $ is a hyperparameter for the first-order discrimination. The larger $\beta $ is, the more important the first-order discrimination behaves in ZFOD. We investigated $\beta $ in a wide range of [0, 3]. From the result in Fig. 3, we can see that, when we increase $\beta $ gradually from 0 to 3, the performance of ZFOD improves steadily, especially on the large dataset Office-home. This phenomenon proves that the first-order difference discrimination facilitates the alignment of the source and target domains. Because ZFOD achieves relatively stable results when $\beta \in [0.5,1.5]$, we chose $\beta =1.5$ as the default value.

The hyperparameter $\gamma $ balances the discrimination loss functions and the complexity of the projection matrix ${\textbf{P}}$. We studied $\beta $ by a grid search of $\{0.1, 0.5, 1, 5, 10, 50, 100\}$. The figure shows that all accuracy curves tend to first rise up and then move down when $\gamma $ is gradually increased. Eventually, we pick $\gamma =1$, which tends to yield the best performance for all tasks.

The hyperparameter d defines the dimension of the subspace. We searched d in a wide range of [10, 200]. The figure shows that with the increasing d, the accuracy curves first gradually rise and then tend to be stable. Finally, we chose $d = 100$, which balances the classification performance and computational complexity.

The hyperparameter $\rho $ in (15) balances the intradomain intraclass discrimination and intradomain interclass discrimination. We studied $\rho $ by a grid search of $\{0.01, 0.05, 0.1, 0.5, 1, 2, 5\}$. The figure shows that ZFOD performs steadily when $\rho \in [0.01, 0.5]$, and drops sharply on the task of “Ar $\rightarrow $ Pr (Office-Home)” when $\rho >0.5$. Therefore, we set $\rho =0.1$ as a safe default value.

From the above analysis, we conclude that ZFOD is insensitive to the hyperparameter selection. Although the performance of ZFOD can be further improved beyond the results of the previous experimental subsections by carefully selecting a hyperparameter setting per task, we did not do so for the real-world applications of ZFOD.

Data visualization

In this subsection, we visualize the distributions and pairwise similarity matrices of data produced by DICD and ZFOD on two randomly selected tasks, which are the “A $\rightarrow $ D (Office-Caltech10)” and “W $\rightarrow $ C (Office-Caltech10)” tasks.

Figures 4 and 5 visualize the data distributions. From the two figures, we see that the data distributions produced by ZFOD have larger interclass distances than those produced by DICD.

Figures 6 and 7 show the similarity matrices of all data across domains, where the similarity between samples is measured by the cosine similarity. A similarity matrix across domains contains three parts: a source-domain similarity matrix which is at the upper-left corner of the full similarity matrix, a target-domain similarity matrix which is at the lower-right corner, and two cross-domain similarity matrices which are at the upper right and lower left corners.

Figures 6a and 7a show that the sample similarity of the cross-domain similarity matrices produced by DICD is quite high for some different classes, which indicates that DICD does not focus enough on domain alignment. In contrast, Figs. 6b and 7b show that ZFOD alleviates this problem, which indicates the effectiveness of the first-order discrimination.

Moreover, Figs. 6a and 7a show that the sample similarity produced by DICD in the source- and target-domain similarity matrices is high for the samples in the same class and not discriminative enough for the samples in different classes. In contrast, Figs. 6b and 7b show that ZFOD not only yields high intraclass similarity as DICD does, but also produces lower interclass similarity than DICD, which indicates the advantage of the zeroth-order discrimination.

Conclusion

In this paper, we proposed the zeroth- and first-order difference discrimination algorithm for unsupervised domain adaptation. It contains three novel components: zeroth-order discrimination, first-order difference discrimination, and TSRP-based pseudolabel generation. The zeroth-order discrimination consists of interdomain discrimination and intradomain discrimination, each of which is further divided into interclass discrimination and intraclass discrimination. The novelty of zeroth-order discrimination is that it covers four important aspects of data discrimination that have not been considered in the literature to our knowledge. Because the interdomain discrimination only maximizes the cross-domain discrimination at the data level without aligning the interclass distances of the source and target domains, the first-order difference discrimination was proposed to overcome the weakness. Because all of the discrimination terms use pseudolabels to define the pseudoclasses at the target domain, it is important to generate highly accurate pseudolabels. Eventually, the TSRP-based pseudolabel generation method is applied. Its core idea is to iteratively pick the pseudolabels with high confidence to train a strong classifier, which is then used to correct the remaining pseudolabels or improve their confidence. We conducted an extensive comparison with nine state-of-the-art conventional UDA methods and seven representative deep learning-based UDA methods. The comparison results demonstrate the effectiveness of the proposed method. The ablation studies further confirm the effectiveness of each novel component of ZFOD, e.g., in improving the discrimination of the domain-invariant feature and aligning the source and domains well.

References

Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359. https://doi.org/10.1109/TKDE.2009.191
Article Google Scholar
Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, Xiong H, He Q (2020) A comprehensive survey on transfer learning. Proc. IEEE 109(1):43–76. https://doi.org/10.1109/JPROC.2020.3004555
Article Google Scholar
Zhang L, Gao X (2022) Transfer adaptation learning: a decade survey. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2022.3183326
Article Google Scholar
Tao H, Qiu J, Chen Y, Stojanovic V, Cheng L (2023) Unsupervised cross-domain rolling bearing fault diagnosis based on time-frequency information fusion. J Franklin Inst 360(2):1454–1477. https://doi.org/10.1016/j.jfranklin.2022.11.004
Article Google Scholar
Tian L, Tang Y, Hu L, Ren Z, Zhang W (2020) Domain adaptation by class centroid matching and local manifold self-learning. IEEE Trans Image Process 29:9703–9718. https://doi.org/10.1109/TIP.2020.3031220
Article MathSciNet Google Scholar
Long M, Wang J, Ding G, Pan SJ, Philip SY (2013) Adaptation regularization: A general framework for transfer learning. IEEE Trans Knowl Data Eng 26(5):1076–1089. https://doi.org/10.1109/TKDE.2013.111
Article Google Scholar
Tang H, Wang Y, Jia K (2022) Unsupervised domain adaptation via distilled discriminative clustering. Pattern Recognition 127:108638. https://doi.org/10.1016/j.patcog.2022.108638
Article Google Scholar
Yang H, He H, Zhang W, Bai Y, Li T (2022) Lie group manifold analysis: an unsupervised domain adaptation approach for image classification. Appl Intell 52(4):4074–4088. https://doi.org/10.1007/s10489-021-02564-3
Article Google Scholar
Zhou C, Tao H, Chen Y, Stojanovic V, Paszke W (2022) Robust point-to-point iterative learning control for constrained systems: A minimum energy approach. Int J Robust Nonlinear Control 32(18):10139–10161. https://doi.org/10.1002/rnc.6354
Article MathSciNet Google Scholar
Liang J, He R, Sun Z, Tan T (2018) Aggregating randomized clustering-promoting invariant projections for domain adaptation. IEEE Trans Pattern Anal Mach Intell 41(5):1027–1042. https://doi.org/10.1109/TPAMI.2018.2832198
Article Google Scholar
Pilanci M, Vural E (2020) Domain adaptation on graphs by learning aligned graph bases. IEEE Transactions on Knowledge and Data Engineering. https://doi.org/10.1109/TKDE.2020.2984212
Article Google Scholar
Wang W, Li H, Ding Z, Nie F, Chen J, Dong X, Wang Z (2021) Rethinking maximum mean discrepancy for visual domain adaptation. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2021.3093468
Article Google Scholar
Zhuang Z, Tao H, Chen Y, Stojanovic V, Paszke W (2022) An optimal iterative learning control approach for linear systems with nonuniform trial lengths under input constraints. IEEE Transactions on Systems, Man, and Cybernetics: Systems. https://doi.org/10.1109/TSMC.2022.3225381
Ben-David S, Blitzer J, Crammer K, Kulesza A, Pereira F, Vaughan JW (2010) A theory of learning from different domains. Mach Learn 79(1):151–175. https://doi.org/10.1007/s10994-009-5152-4
Article MathSciNet Google Scholar
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
Article MathSciNet Google Scholar
Gretton A, Borgwardt K, Rasch M, Schölkopf B, Smola A (2006) A kernel method for the two-sample-problem. Advances in neural information processing systems 19
Pan SJ, Tsang IW, Kwok JT, Yang Q (2011) Domain adaptation via transfer component analysis. IEEE Trans Neural Netw 22(2):199–210. https://doi.org/10.1109/TNN.2010.2091281
Article Google Scholar
Chen Y, Song S, Li S, Wu C (2019) A graph embedding framework for maximum mean discrepancy-based domain adaptation algorithms. IEEE Trans Image Process 29:199–213. https://doi.org/10.1109/TIP.2019.2928630
Article MathSciNet Google Scholar
Long M, Wang J, Ding G, Sun J, Yu PS (2013) Transfer feature learning with joint distribution adaptation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2200–2207
Wang J, Feng W, Chen Y, Yu H, Huang M, Yu PS (2018) Visual domain adaptation with manifold embedded distribution alignment. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 402–410
Kang G, Jiang L, Yang Y, Hauptmann AG (2019) Contrastive adaptation network for unsupervised domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4893–4902
Li J, Jing M, Lu K, Zhu L, Shen HT (2019) Locality preserving joint transfer for domain adaptation. IEEE Trans Image Process 28(12):6103–6115. https://doi.org/10.1109/TIP.2019.2924174
Article MathSciNet Google Scholar
Wang Q, Breckon T (2020) Unsupervised domain adaptation via structured prediction based selective pseudo-labeling. Proc AAAI Conf Artificial Intell 34:6243–6250. https://doi.org/10.1609/aaai.v34i04.6091
Article Google Scholar
Li S, Song S, Huang G, Ding Z, Wu C (2018) Domain invariant and class discriminative feature learning for visual domain adaptation. IEEE Trans Image Process 27(9):4260–4273. https://doi.org/10.1109/TIP.2018.2839528
Article MathSciNet Google Scholar
Li S, Liu CH, Su L, Xie B, Ding Z, Chen CP, Wu D (2020) Discriminative transfer feature and label consistency for cross-domain image classification. IEEE Trans Neural Netw Learn Syst 31(11):4842–4856. https://doi.org/10.1109/TNNLS.2019.2958152
Article MathSciNet Google Scholar
Yang B, Yuen PC (2019) Cross-domain visual representations via unsupervised graph alignment. Proc AAAI Conf Artificial Intell 33:5613–5620
Google Scholar
Wang J, Zhang X-L (2023) Improving pseudo labels with intra-class similarity for unsupervised domain adaptation. Pattern Recognition, 109379 . https://doi.org/10.1016/j.patcog.2023.109379
Fukunaga K, Narendra PM (1975) A branch and bound algorithm for computing k-nearest neighbors. IEEE Trans Comput 100(7):750–753
Article Google Scholar
Gong B, Shi Y, Sha F, Grauman K (2012) Geodesic flow kernel for unsupervised domain adaptation. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2066–2073. IEEE
Zhang J, Li W, Ogunbona P (2017) Joint geometrical and statistical alignment for visual domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1859–1867
Liang J, He R, Sun Z, Tan T (2019) Distant supervised centroid shift: A simple and efficient approach to visual domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2975–2984
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778
Long M, Cao Y, Wang J, Jordan M (2015) Learning transferable features with deep adaptation networks. In: International Conference on Machine Learning, pp. 97–105. PMLR
Long M, Zhu H, Wang J, Jordan MI (2016) Unsupervised domain adaptation with residual transfer networks. Advances in neural information processing systems 29
Pei Z, Cao Z, Long M, Wang J (2018) Multi-adversarial domain adaptation. In: Thirty-second AAAI Conference on Artificial Intelligence
Long M, Cao Z, Wang J, Jordan MI (2018) Conditional adversarial domain adaptation. Advances in neural information processing systems 31
Long M, Zhu H, Wang J, Jordan MI (2017) Deep transfer learning with joint adaptation networks. In: International Conference on Machine Learning, pp. 2208–2217. PMLR
Zhang W, Ouyang W, Li W, Xu D (2018) Collaborative and adversarial network for unsupervised domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3801–3809
Saito K, Watanabe K, Ushiku Y, Harada T (2018) Maximum classifier discrepancy for unsupervised domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3723–3732
Saenko K, Kulis B, Fritz M, Darrell T (2010) Adapting visual category models to new domains. In: European Conference on Computer Vision, pp. 213–226. https://doi.org/10.1007/978-3-642-15561-1_16. Springer
Caputo B, Müller H, Martinez-Gomez J, Villegas M, Acar B, Patricia N, Marvasti N, Üsküdarlı S, Paredes R, Cazorla M, et al. (2014) Imageclef 2014: Overview and analysis of the results. In: International Conference of the Cross-Language Evaluation Forum for European Languages, pp. 192–211. Springer
Venkateswara H, Eusebio J, Chakraborty S, Panchanathan S (2017) Deep hashing network for unsupervised domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5018–5027
Oza P, Sindagi VA, Sharmini VV, Patel VM (2023) Unsupervised domain adaptation of object detectors: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence
Huang J, Gretton A, Borgwardt K, Schölkopf B, Smola A (2006) Correcting sample selection bias by unlabeled data. Advances in neural information processing systems 19
Chen S, Zhou F, Liao Q (2016) Visual domain adaptation using weighted subspace alignment. In: 2016 Visual Communications and Image Processing (VCIP), pp. 1–4. https://doi.org/10.1109/VCIP.2016.7805516. IEEE
Sun B, Feng J, Saenko K (2016) Return of frustratingly easy domain adaptation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30
Zellinger W, Grubinger T, Lughofer E, Natschläger T, Saminger-Platz S (2017) Central moment discrepancy (cmd) for domain-invariant representation learning. arXiv preprint arXiv:1702.08811
Tzeng E, Hoffman J, Zhang N, Saenko K, Darrell T (2014) Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474
Du Z, Li J, Su H, Zhu L, Lu K (2021) Cross-domain gradient discrepancy minimization for unsupervised domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3937–3946
Xu Y, Cao H, Mao K, Chen Z, Xie L, Yang J (2022) Aligning correlation information for domain adaptation in action recognition. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2022.3212909
Article Google Scholar
Saito K, Ushiku Y, Harada T (2017) Asymmetric tri-training for unsupervised domain adaptation. In: International Conference on Machine Learning, pp. 2988–2997. PMLR
Du Y, Zhou D, Xie Y, Lei Y, Shi J (2023) Prototype-guided feature learning for unsupervised domain adaptation. Pattern Recognit 135:109154. https://doi.org/10.1016/j.patcog.2022.109154
Article Google Scholar
Wang R, Wu Z, Weng Z, Chen J, Qi G-J, Jiang Y-G (2022) Cross-domain contrastive learning for unsupervised domain adaptation. IEEE Trans Multimedia. https://doi.org/10.1109/TMM.2022.3146744
Article Google Scholar
Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) Decaf: A deep convolutional activation feature for generic visual recognition. In: International Conference on Machine Learning, pp. 647–655. PMLR

Download references

Acknowledgements

This work was supported in part by the National Science Foundation of China (NSFC) under Grant 62176211, and in part by the Project of the Science, Technology, and Innovation Commission of Shenzhen Municipality, China under Grant JCYJ20210324143006016 and JSGG20210802152546026.

Author information

Authors and Affiliations

School of Marine Science and Technology, Northwestern Polytechnical University, 127 Youyi West Road, Xi’an, 710072, Shaanxi, China
Jie Wang, Xing Chen & Xiao-Lei Zhang
Research and Development Institute of Northwestern Polytechnical University in Shenzhen, Shenzhen, 518063, Guangdong, China
Jie Wang, Xing Chen & Xiao-Lei Zhang

Authors

Jie Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xing Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Lei Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiao-Lei Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, J., Chen, X. & Zhang, XL. Zeroth- and first-order difference discrimination for unsupervised domain adaptation. Complex Intell. Syst. 10, 2569–2584 (2024). https://doi.org/10.1007/s40747-023-01283-1

Download citation

Received: 02 May 2023
Accepted: 04 November 2023
Published: 05 December 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s40747-023-01283-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Zeroth- and first-order difference discrimination for unsupervised domain adaptation

Abstract

Similar content being viewed by others

A survey on semi-supervised learning

A survey of transfer learning

Learning from imbalanced data: open challenges and future directions

Introduction

Related work

Methods

Optimization objective

Interdomain discrimination

Interdomain intraclass distance discrimination

Interdomain interclass distance discrimination

Intradomain discrimination

Intradomain intraclass discrimination

Intradomain interclass discrimination

First-order difference discrimination

Definition

Discussion

Optimization algorithm

Solving problem (1) given the pseudolabels

Improving the pseudolabels given the solution in (1)

Computational complexity

Storage complexity

Experiments

Datasets and cross-domain tasks

Experimental settings

Conventional UDA methods

Deep learning-based UDA methods

Main results

Ablation study

Effect of the TSRP-based pseudolabel generation method on performance

Effect of first-order discrimination on performance

Effect of zeroth-order discrimination on performance

Effects of hyperparameters on performance

Data visualization

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation