Introduction

It is known that machine learning benefits from large-scale manually labeled data. However, manual labeling is often time-consuming and labor-intensive. Dealing with scenarios where manually labeled data are insufficient is a difficult task. A natural thought for this problem is to transfer the labeled to unlabeled data knowledge. However, the mismatch between the data distributions may lead to catastrophic results. To address mismatched distribution problem, much efforts have been devoted to unsupervised domain adaptation (UDA) [1,2,3,4], which transfers empirical knowledge from a label-rich source domain to an unlabeled target domain with a different distribution.

Fig. 1
figure 1

Principle of the proposed ZFOD. a Zeroth-order discrimination of ZFOD. b Discrimination shift problem of the zeroth-order discrimination. c First-order difference discrimination for remedying the discrimination shift problem, where the length of the dotted lines represents the interclass distance within the domains

The core UDA issue is to reduce the distribution gap between the source domain and the target domain [5,6,7,8,9]. A common thought for this problem is to find a shared subspace where the source and target data distributions are similar [10,11,12,13]. To learn the subspace, a metric is first needed to evaluate the distribution difference. Common metrics include A-distance [14], KL divergence [15], and maximum mean discrepancy (MMD) [16]. Then, an optimization objective defined on the metric with continuous constraints is used to align the source and target domains. For example, [17] reduced the mean deviation of the domains by learning a linear projection matrix with MMD.

Recently, many methods have aimed to further improve the discrimination of the source and target data as an addition to simply aligning the distributions since the latter may weaken data discrimination [12, 18]. The optimization objectives of discriminative learning methods can be roughly divided into four types: reducing the intraclass distances between domains [19, 20], enlarging the interclass distances between domains [21], reducing the intraclass distances within domains [22, 23], and enlarging the interclass distances within domains [24, 25]. For example, [19] aligned the conditional distributions of the source and target domains by reducing the mean deviating between a class in the source domain and the same class in the target domain, which can be regarded as a method of reducing the intraclass distance between domains. [21] proposed a contrastive domain discrepancy to reduce the intraclass distance and enlarge the interclass distance between domains. [25] takes the maximum distance of the data pairs in the same class and the minimum distance of the data pairs in different classes as regularization terms to align the source and target domains, which can be regarded as a mixture of reducing the intraclass distance between and within domains as well as enlarging the interclass distance within domains. Each type of objective shows effectiveness in improving the discrimination of the data representations to some extent; however, to our knowledge, none of the existing methods consider all four objectives together.

Moreover, if multiple types of discriminative optimization objectives are used together, the effects of different objectives on the data representations may be dramatically different, particularly for tasks with large domain shifts. Figures 1a and 1b illustrate this problem, where Fig. 1a demonstrates the principles of multiple types of discriminative optimization objectives, and Fig. 1b shows the new data distribution generated by the objectives. From Fig. 1b, we can see that the discrimination of the data distributions is significantly improved compared to that in Fig. 1a; however, the discrimination of the source and target data is different. We call this inconsistent domain alignment the discrimination shift, which will finally lead to unsatisfied transfer performance. This problem seems insufficiently explored.

In this paper, we propose a novel UDA algorithm, named zeroth- and first-order difference discrimination (ZFOD), to address the above problems. It first optimizes all four discriminative optimization objectives jointly. They are implemented as follows: (i) The minimization of the intraclass distances between domains is implemented by minimizing the distance between the means of the same class across domains. (ii) The maximization of the interclass distance between domains is implemented by maximizing the distance between the mean of a class in the source domain and the mean of another class in the target domain. (iii) The minimization of the intraclass distance within domains is implemented by minimizing the distance of every two samples in the same class for the source and target domains. (iv) The maximization of the interclass distance within domains is implemented by maximizing the distance of every two samples from different classes of either the source domain or the target domain.

To align the components of the ZFOD objective for first-order difference discrimination of data across the source and target domains, motivated by [26], we propose a novel first-order difference constraint. As illustrated in Fig. 1c, when ZFOD maximizes the interclass distance within domains, the first-order difference constraint aims to constrain the distances in the source and target domains to be the same.

Notably, optimizing the objective of ZFOD needs to obtain pseudolabels for the target domain. However, the accumulation of pseudolabel errors during the iterative optimization process degrades the performance significantly. To alleviate this difficulty, we use a simple and effective method, named mining target domain intraclass similarity to remedy the pseudolabels (TSRP) [27] and improve accuracy. To summarize, the novelty and contribution of the paper is summarized as follows:

  • We propose jointly optimizing all four discriminative optimization objectives, which is the zeroth-order discrimination of ZFOD.

  • We propose first-order difference discrimination to align the objective components. The optimization of the discrimination term in each iteration is formulated as a generalized eigenvalue decomposition problem, which has a simple closed-form solution.

  • We conducted an extensive comparison with nine representative conventional methods [10, 19, 20, 24, 25, 28,29,30,31] and seven remarkable deep learning-based methods [32,33,34,35,36,37,38,39] on four benchmark datasets, including Office+Caltech10 [29], Office-31 [40], ImageCLEF-DA [41], and Office-Home [42]. Experimental results demonstrate the competitiveness of the proposed method with not only conventional comparison methods but also deep learning-based comparison methods.

The remainder of this paper is organized as follows: In Sect. “Related work”, we review some related works. In Sect.  “Methods”, we present the proposed ZFOD. The experimental results are reported in Sect. “Experiments”. Finally, we conclude this paper in Sect. “Conclusion”.

Related work

In this section, we first review the methods of learning domain-invariant features and then review the methods that focus on improving data discrimination.

Many domain-invariant feature learning methods have been proposed recently [3, 43], which can be roughly divided into two categories: instance reweighting adaptation methods and feature adaptation methods [3]. Instance reweighting adaptation methods aim to allocate resampling weights directly by feature distribution matching across different domains in a nonparametric manner [22, 44, 45]. For example, [45] proposed an intuitive weighting-based subspace alignment method by reweighting the source samples, which generates a source subspace that is close to a target subspace. [22] reweighted instances by landmark selection so that the pivot samples of the landmarks can be selected as a knowledge transfer bridge, and the outliers can be filtered out. Feature adaptation methods aim to obtain a domain-invariant feature by aligning the domain data distributions [29, 33, 46,47,48,49,50]. For instance, [29] maps both domains into the Grassmann manifold and models the domain shift by constructing geodesic flows. [46] aligned the distributions of source and target domains by aligning their second-order statistics. [47] aligned the distributions indirectly by minimizing the difference between the higher-order central moments of the domains.

In recent years, the idea of generating pseudolabels for the target domain has become popular in UDA [19, 23,24,25]. However, inaccurate pseudolabels may yield unsatisfactory performance with error accumulation during the optimization process. Therefore, some methods aim to improve pseudolabel accuracy [23, 25, 51]. For example, [51] used three asymmetric classifiers to improve pseudolabel accuracy, where two of the classifiers were used to select confident pseudolabels, and the third learned a discriminative data representation for the target domain. [23] proposed generating accurate pseudolabels by selective pseudolabeling and structured prediction. [52] proposed multistage adaptive label filtering to increase correctly labeled target samples.

With the increasing usage of pseudolabels, an increasing number of methods are considering how to improve the discrimination of a data representation while maintaining its domain-invariant property. Here, we list some representative methods that transform the unsupervised target domain into a supervised target domain with pseudolabels. [24] minimizes the distance of each pair of samples in a class and maximizes the distance of any two samples that belong to different classes for the source and target domains. [5] obtained the pseudolabels of the target domain by clustering and then aligned the class centers of the source and target domains for a domain-invariant subspace. [12] proposed a supervised discriminative MMD with pseudolabels to mitigate the degradation of feature discriminability incurred by MMD. [10] developed an ensemble model by a clustering-promoting technology and obtained the final decision of unlabeled target data via majority voting. [23] utilized supervised locality-preserving projection to reduce the distances between class samples across the source and target domains. [53] proposed cross-domain contrastive learning to make samples within the same category close to each other, while samples from different classes lie far apart, regardless of which domain they come from.

The methods mentioned above only consider part of the discrimination. In this paper, we divide the UDA methods for improving data discrimination into four categories. With this observation, we propose a zeroth-order discrimination objective to further improve the discrimination of source and target data and keep the data distributions of the two domains aligned. Inspired by [26], which constrained the distance of any two samples belonging to different classes to a constant, we propose first-order difference discrimination for UDA. Unlike [26], the proposed first-order difference discrimination constrains the interclass distances of the source domain to be as equal as possible to that of the target domain, thereby mitigating the discriminative differences between the two domains.

Methods

In this section, we first present the proposed ZFOD framework in Sect. “Optimization objective”, and then describe its components, which include the interdomain discrimination, intradomain discrimination, first-order difference discrimination, regularization of the projection model, and pseudolabel generation algorithm respectively from Sect. “Interdomain discrimination” to Sect. “Improving the pseudolabels given the solution in (1)”, in detail.

Optimization objective

Assume the source domain contains \(n_s\) labeled data points \(\{({\textbf{x}}^{i}_s,y^i_s)\}_{i=1}^{n_s}\) and the target domain contains \(n_t\) unlabeled data points {\({\textbf{x}}^{j}_t\}_{j=1}^{n_t}\) where \({\textbf{x}}^i_s, {\textbf{x}}^{j}_t\in {\mathbb {R}}^{m}\) and \(y^i_s\in \{1,2,\ldots ,C\}\) with m denoted as the feature dimension and C denoted as the number of classes. We denote \({\textbf{X}}_s = [{\textbf{x}}^{1}_s,\ldots , {\textbf{x}}^{n_s}_s]\) and \({\textbf{X}}_t = [{\textbf{x}}^{1}_t,\ldots , {\textbf{x}}^{n_t}_t]\). The whole data matrix is \({\textbf{X}} = [{\textbf{X}}_s,{\textbf{X}}_t]\in {\mathbb {R}}^{m\times n}\) with \(n=n_s+n_t\). In this paper, we generate and optimize pseudolabels for the target domain. We denote the pseudolabel of the target data \({\textbf{x}}^j_t\) as \({\hat{y}}^j_t\). Our goal is to find a common feature subspace, defined by a projection matrix \({\textbf{P}}\in {\mathbb {R}}^{m\times d}\), such that the new feature representations of the two domains in the new subspace, i.e., \({\textbf{z}}_{s}^i={\textbf{P}}^{\top }{\textbf{x}}_{s}^i \) and \({\textbf{z}}_{t}^j={\textbf{P}}^{\top }{\textbf{x}}_{t}^j\), can be effectively aligned, where d is the feature dimension of the subspace.

The core idea of our ZFOD is (i) the zeroth-order discrimination, which consists of four discriminative optimization objectives and can be roughly divided into two categories: the interdomain discrimination \({\mathcal {L}}_{\textrm{interD}}\) and intradomain discrimination \({\mathcal {L}}_{\textrm{intraD}}\) for improving the data discrimination, and (ii) the first-order difference discrimination \({\mathcal {L}}_{\textrm{FD}}\) for aligning the interclass discrimination of the source and target domains. ZFOD is formulated as the following optimization problem:

$$\begin{aligned} \begin{aligned}&\min _{{\textbf{P}}}{\mathcal {L}}_{\textrm{interD}}+\alpha {\mathcal {L}}_{\textrm{intraD}}+\beta {\mathcal {L}}_{\textrm{FD}}+ \gamma \left\| {\textbf{P}}\right\| _{F}^{2} \end{aligned} \end{aligned}$$
(1)

where the operator \(\left\| \cdot \right\| _{F}^{2}\) is the Frobenius norm, and \(\left\| {\textbf{P}}\right\| _{F}^{2}\) is a regularization term of \({\textbf{P}}\) to avoid overfitting, \(\alpha \), \(\beta \) and \(\gamma \) are three tradeoff parameters, and \({\mathcal {L}}_{\textrm{interD}}\), \({\mathcal {L}}_{\textrm{intraD}}\), and \({\mathcal {L}}_{\textrm{FD}}\) are the functions of the variables \({\textbf{P}} \) and \(\{{\hat{y}}^j_t\}_{j=1}^{n_t}\). For simplicity, we fix \(\alpha =1\), which should yield good performance after an empirical investigation on some benchmark datasets.

Interdomain discrimination

The interdomain discrimination objective \({\mathcal {L}}_{\textrm{interD}}\) aligns the distributions of the source samples and target samples, so that the discrimination of the source samples can be transferred to the unlabeled target samples. Given the pseudolabels for the target domain, the interdomain discrimination can be further divided into two parts: interdomain intraclass distance discrimination and interdomain interclass distance discrimination, which are described respectively as follows:

Interdomain intraclass distance discrimination

Interdomain interclass distance discrimination aligns the data distributions of each single class across the two domains as similarly as possible in the subspace, which can be considered a set of C independent UDA problems. We evaluate the interdomain divergence of a class across domains by MMD [16]. The alignment over all classes is formulated as a minimization problem of the following objective:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}^{\textrm{intraC}}_{\textrm{interD}}&=\sum _{k=0}^{C}\left\| \frac{1}{n_{s}^{k}} \sum _{{\textbf{x}}_{s}^i \in {\textbf{X}}_{s}^{k}} {\textbf{z}}_{s}^{i}-\frac{1}{n_{t}^{k}} \sum _{{\textbf{x}}_{t}^j \in {\textbf{X}}_{t}^{k}} {\textbf{z}}_{t}^j\right\| ^{2}\\&=\sum _{k=0}^{C} {\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{M}}_{k} {\textbf{X}}^{\top } {\textbf{P}}\right) \\&={\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{M}}^{\textrm{intraC}} {\textbf{X}}^{\top } {\textbf{P}}\right) \end{aligned}\nonumber \\ \end{aligned}$$
(2)

where \({\textbf{X}}_{s}^{k}\) denotes the source samples of class k and \({\textbf{X}}_{t}^{k}\) denotes the target samples with their pseudoclass labels set to k, and \(n_{s}^{k}\) and \(n_{t}^{k}\) are the number of the source and target samples of class k, respectively; particularly, the class \(k=0\) denotes the data of the whole source or target domain; \({\textbf{M}}_{k}\) is the intraclass matrix of class k across domains with its element \(\left( {\textbf{M}}_{k}\right) _{i j}\) defined as:

$$\begin{aligned} \left( {\textbf{M}}_{k}\right) _{i j}= {\left\{ \begin{array}{ll}\frac{1}{n_{s}^{k} n_{s}^{k}}, &{} \forall {\textbf{x}}^{i}, {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{k} \\ \frac{1}{n_{t}^{k} n_{t}^{k}}, &{} \forall {\textbf{x}}^{i}, {\textbf{x}}^{j} \in {\textbf{X}}_{t}^{k} \\ -\frac{1}{n_{s}^{k} n_{t}^{k}}, &{} \forall {\textbf{x}}^{i} \in {\textbf{X}}_{s}^{k}, {\textbf{x}}^{j} \in {\textbf{X}}_{t}^{k}\\ -\frac{1}{n_{s}^{k} n_{t}^{k}}, &{} \forall {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{k}, {\textbf{x}}^{i} \in {\textbf{X}}_{t}^{k}\\ 0, &{} \text{ otherwise } \end{array}\right. } \end{aligned}$$
(3)

and \({\textbf{M}}^{\textrm{intraC}} = \sum _{k=0}^C{\textbf{M}}_{k}\). Note that \({\textbf{M}}_{k}\) is also known as the class-conditional MMD matrix for class k.

Interdomain interclass distance discrimination

Interdomain interclass distance discrimination improves cross-domain data discrimination such that each single class in a domain is far apart from the other classes in the other domain. The interdomain divergence is evaluated by MMD as well. Unlike the problem in Sect. “Interdomain intraclass distance discrimination”, the interdomain interclass distance discrimination is formulated as a maximization of the following objective:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}^{\textrm{interC}}_{\textrm{interD}}&=\sum _{k=1}^{C-1}\sum _{q=k+1}^{C}\left\| \frac{1}{n_{s}^{k}} \sum _{{\textbf{x}}_{s}^i \in {\textbf{X}}_{s}^{k}} {\textbf{z}}_{s}^{i}-\frac{1}{n_{t}^{q}} \sum _{{\textbf{x}}_{t}^j \in {\textbf{X}}_{t}^{q}} {\textbf{z}}_{t}^{j}\right\| ^{2}\\&=\sum _{k=1}^{C-1}\sum _{q=k+1}^{C} {\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{M}}_{kq} {\textbf{X}}^{\top } {\textbf{P}}\right) \\&={\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{M}}^{\textrm{interC}} {\textbf{X}}^{\top } {\textbf{P}}\right) \end{aligned}\nonumber \\ \end{aligned}$$
(4)

where \({\textbf{M}}_{kq}\) is the interclass matrix of class k in the source domain and pseudoclass q in the target domain:

$$\begin{aligned} \left( {\textbf{M}}_{kq}\right) _{i j}= {\left\{ \begin{array}{ll}\frac{1}{n_{s}^{k} n_{s}^{k}}, &{} \forall {\textbf{x}}^{i}, {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{k} \\ \frac{1}{n_{t}^{q} n_{t}^{q}}, &{} \forall {\textbf{x}}^{i}, {\textbf{x}}^{j} \in {\textbf{X}}_{t}^{q} \\ -\frac{1}{n_{s}^{k} n_{t}^{q}}, &{} \forall {\textbf{x}}^{i} \in {\textbf{X}}_{s}^{k}, {\textbf{x}}^{j} \in {\textbf{X}}_{t}^{q} \\ -\frac{1}{n_{s}^{k} n_{t}^{q}}, &{} \forall {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{k}, {\textbf{x}}^{i} \in {\textbf{X}}_{t}^{q} \\ 0, &{} \text{ otherwise } \end{array}\right. }. \end{aligned}$$
(5)

and \({\textbf{M}}^{\textrm{interC}} = \sum _{k=1}^{C-1}\sum _{q=k+1}^{C}{\textbf{M}}_{kq}\).

We regard that intraclass and interclass discrimination as equivalently important. Eventually, the interdomain discrimination is formulated as:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{\textrm{interD}}&={\mathcal {L}}^{\textrm{intraC}}_{\textrm{interD}}- {\mathcal {L}}^{\textrm{interC}}_{\textrm{interD}} \\&={\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{M}} {\textbf{X}}^{\top } {\textbf{P}}\right) \end{aligned} \end{aligned}$$
(6)

where \({\textbf{M}}\) represents the interdomain discrimination matrix:

$$\begin{aligned} {\textbf{M}}={\textbf{M}}^{\textrm{intraC}}-{\textbf{M}}^{\textrm{interC}}. \end{aligned}$$
(7)

Intradomain discrimination

The intradomain discrimination objective \({\mathcal {L}}_{\textrm{intraD}}\) improves the data discrimination at each domain, which can be divided into two parts: intradomain intraclass distance discrimination and intradomain interclass distance discrimination.

Intradomain intraclass discrimination

Intradomain intraclass discrimination reduces the intraclass distance in each domain. Here we define the intraclass distance as the average distance of pairwise samples in a class, which results in a minimization problem of the following objective:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}^{\textrm{intraC}}_{\textrm{intraD}}&=\sum _{k=1}^{C} \frac{n_{s}}{n_{s}^{k}} \sum _{{\textbf{x}}_{s}^i, {\textbf{x}}_{s}^j\in {\textbf{X}}_s^k}\left\| {\textbf{z}}_{s}^i-{\textbf{z}}_{s}^j\right\| ^{2}\\&\quad +\sum _{k=1}^{C} \frac{n_{t}}{n_{t}^{k}} \sum _{{\textbf{x}}_{t}^i, {\textbf{x}}_{t}^j\in {\textbf{X}}_t^k}\left\| {\textbf{z}}_{t}^i-{\textbf{z}}_{t}^j\right\| ^{2} \\&={\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}}_{s} {\textbf{W}}_{s}^{\textrm{intraC}} {\textbf{X}}_{s}^{\top } {\textbf{P}}\right) +{\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}}_{t} {\textbf{W}}_{t}^{\textrm{intraC}} {\textbf{X}}_{t}^{\top } {\textbf{P}}\right) \\&={\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{W}}^{\textrm{intraC}} {\textbf{X}}^{\top } {\textbf{P}}\right) \end{aligned} \end{aligned}$$
(8)

where \({\textbf{W}}_{s}^{\textrm{intraC}}\) and \({\textbf{W}}_{t}^{\textrm{intraC}}\) are the intraclass matrices of the source domain and target domain respectively:

$$\begin{aligned} \left( {\textbf{W}}_{s}^{\textrm{intraC}}\right) _{i j}= {\left\{ \begin{array}{ll}n_{s}, &{} \text{ if } i=j \\ -\frac{n_{s}}{n_{s}^{k}}, &{} \text{ if } i \ne j \\ 0, &{} \text{ otherwise }\end{array}\right. },\quad \forall {\textbf{x}}_s^i, {\textbf{x}}_s^j\in {\textbf{X}}_s^k \nonumber \\ \end{aligned}$$
(9)
$$\begin{aligned} \left( {\textbf{W}}_{t}^{\textrm{intraC}}\right) _{i j}= {\left\{ \begin{array}{ll}n_{t}, &{} \text{ if } i=j \\ -\frac{n_{t}}{n_{t}^{k}}, &{} \text{ if } i \ne j \\ 0, &{} \text{ otherwise } \end{array}\right. },\quad \forall {\textbf{x}}_t^i, {\textbf{x}}_t^j\in {\textbf{X}}_t^k \nonumber \\ \end{aligned}$$
(10)

and \({\textbf{W}}^{\textrm{intraC}}={\text {diag}}\left( {\textbf{W}}_{s}^{\textrm{intraC}},{\textbf{W}}_{t}^{\textrm{intraC}}\right) \).

Intradomain interclass discrimination

Intradomain interclass discrimination enlarges the interclass distance in each domain. Here, we define the interclass distance as the average of the distances of any pair of samples that belong to different classes in a domain, which results in a maximization problem of the following objective:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}^{\textrm{interC}}_{\textrm{intraD}}&=\sum _{y_{s}^i \ne y_{s}^j}\left\| {\textbf{z}}_{s}^i-{\textbf{z}}_{s}^ j\right\| ^{2}+\sum _{{\hat{y}}_{t}^i \ne {\hat{y}}_{t}^j}\left\| {\textbf{z}}_{t}^i-{\textbf{z}}_{t}^j\right\| ^{2} \\&={\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}}_{s} {\textbf{W}}_{s}^{\textrm{interC}} {\textbf{X}}_{s}^{\top } {\textbf{P}}\right) +{\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}}_{t} {\textbf{W}}_{t}^{\textrm{interC}} {\textbf{X}}_{t}^{\top } {\textbf{P}}\right) \\&={\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{W}}^{\textrm{interC}} {\textbf{X}}^{\top } {\textbf{P}}\right) \end{aligned} \nonumber \\ \end{aligned}$$
(11)

where \({\textbf{W}}_{s}^{\textrm{interC}}\) and \({\textbf{W}}_{t}^{\textrm{interC}}\) are the interclass matrices of the source domain and target domain, respectively:

$$\begin{aligned} \left( {\textbf{W}}_{s}^{\textrm{interC}}\right) _{i j}= {\left\{ \begin{array}{ll}n_{s}-n_{s}^{k}, &{} \text{ if } i=j \text { and } y_{s}^i=k \\ -1, &{} \text{ if } i \ne j \text { and } y_{s}^i \ne y_{s}^j \\ 0, &{} \text{ otherwise } \end{array}\right. } \end{aligned}$$
(12)
$$\begin{aligned} \left( {\textbf{W}}_{t}^{\textrm{interC}}\right) _{i j}= {\left\{ \begin{array}{ll}n_{t}-n_{t}^{k}, &{} \text{ if } i=j \text { and } {\hat{y}}_{t}^i=k \\ -1, &{} \text{ if } i \ne j \text { and } {\hat{y}}_{t}^i \ne {\hat{y}}_{t}^j \\ 0, &{} \text{ otherwise } \end{array}\right. } \end{aligned}$$
(13)

and \({\textbf{W}}^{\textrm{interC}}={\text {diag}}\left( {\textbf{W}}_{s}^{\textrm{interC}},{\textbf{W}}_{t}^{\textrm{interC}}\right) \).

Finally, we formulate the intradomain discrimination from (8) and (11) as follows:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{\textrm{intraD}}&={\mathcal {L}}^{\textrm{intraC}}_{\textrm{intraD}}-\rho {\mathcal {L}}^{\textrm{interC}}_{\textrm{intraD}}\\&={\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{W}}^{\textrm{intraC}} {\textbf{X}}^{\top } {\textbf{P}}\right) -\rho {\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{W}}^{\textrm{interC}} {\textbf{X}}^{\top } {\textbf{P}}\right) \\&={\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{W}} {\textbf{X}}^{\top } {\textbf{P}}\right) \end{aligned} \nonumber \\ \end{aligned}$$
(14)

where \(\rho \) is a hyperparameter for balancing the two terms, and \({\textbf{W}}\) is the intradomain discrimination matrix:

$$\begin{aligned} {\textbf{W}}={\textbf{W}}^{\textrm{intraC}}-\rho {\textbf{W}}^{\textrm{interC}} \end{aligned}$$
(15)

First-order difference discrimination

Definition

The first-order difference discrimination objective \({\mathcal {L}}_{FD}\) constrains the distance between any two classes of a domain to be similar to the distance of the same pair of classes of another domain, which is formulated as a minimization problem of the following objective:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{\textrm{FD}}&= \sum _{k=1}^{C-1}\sum _{q=k+1}^{C}\left\| {\frac{1}{n_{s}^{k}} \sum _{{\textbf{x}}_{s}^i \in {\textbf{X}}_{s}^{k}} {\textbf{z}}_{s}^{i}-\frac{1}{n_{s}^{q}} \sum _{{\textbf{x}}_{s}^j \in {\textbf{X}}_{s}^{q}} {\textbf{z}}_{s}^j}\right. \\&\quad \left. {-\frac{1}{n_{t}^{k}} \sum _{{\textbf{x}}_{t}^i \in {\textbf{X}}_{t}^{k}} {\textbf{z}}_{t}^i+\frac{1}{n_{t}^{q}} \sum _{{\textbf{x}}_{t}^j \in {\textbf{X}}_{t}^{q}} {\textbf{z}}_{t}^j}\right\| ^{2}\\&=\sum _{k=1}^{C-1}\sum _{q=k+1}^{C} {\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{S}}_{kq} {\textbf{X}}^{\top } {\textbf{P}}\right) \\&={\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} {\textbf{S}} {\textbf{X}}^{\top } {\textbf{P}}\right) \end{aligned} \nonumber \\ \end{aligned}$$
(16)

where

$$\begin{aligned} \left( {\textbf{S}}_{kq}\right) _{i j}= {\left\{ \begin{array}{ll}\frac{1}{n_{s}^{k} n_{s}^{k}}, &{} \forall {\textbf{x}}^{i}, {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{k} \\ \frac{1}{n_{s}^{q} n_{s}^{q}}, &{} \forall {\textbf{x}}^{i}, {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{q} \\ \frac{1}{n_{t}^{k} n_{t}^{k}}, &{} \forall {\textbf{x}}^{i}, {\textbf{x}}^{j} \in {\textbf{X}}_{t}^{k} \\ \frac{1}{n_{t}^{q} n_{t}^{q}}, &{} \forall {\textbf{x}}^{i}, {\textbf{x}}^{j} \in \hat{{\textbf{X}}}_{t}^{q} \\ \frac{1}{n_{s}^{k} n_{t}^{q}}, &{} \forall {\textbf{x}}^{i} \in {\textbf{X}}_{s}^{k}, {\textbf{x}}^{j} \in {\textbf{X}}_{t}^{q} \ \text {or} \ \forall {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{k}, {\textbf{x}}^{i} \in {\textbf{X}}_{t}^{q}\\ \frac{1}{n_{s}^{q} n_{t}^{k}}, &{} \forall {\textbf{x}}^{i} \in {\textbf{X}}_{s}^{q}, {\textbf{x}}^{j} \in {\textbf{X}}_{t}^{k} \ \text {or} \ \forall {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{q}, {\textbf{x}}^{i} \in {\textbf{X}}_{t}^{k}\\ -\frac{1}{n_{s}^{k} n_{s}^{q}}, &{} \forall {\textbf{x}}^{i} \in {\textbf{X}}_{s}^{k}, {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{q} \ \text {or} \ \forall {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{k}, {\textbf{x}}^{i} \in {\textbf{X}}_{s}^{q}\\ -\frac{1}{n_{s}^{k} n_{t}^{k}}, &{} \forall {\textbf{x}}^{i} \in {\textbf{X}}_{s}^{k}, {\textbf{x}}^{j} \in {\textbf{X}}_{t}^{k} \ \text {or} \ \forall {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{k}, {\textbf{x}}^{i} \in {\textbf{X}}_{t}^{k}\\ -\frac{1}{n_{s}^{q} n_{t}^{q}}, &{} \forall {\textbf{x}}^{i} \in {\textbf{X}}_{s}^{q}, {\textbf{x}}^{j} \in {\textbf{X}}_{t}^{q} \ \text {or} \ \forall {\textbf{x}}^{j} \in {\textbf{X}}_{s}^{q}, {\textbf{x}}^{i} \in {\textbf{X}}_{t}^{q}\\ -\frac{1}{n_{t}^{k} n_{t}^{q}}, &{} \forall {\textbf{x}}^{i} \in {\textbf{X}}_{t}^{k}, {\textbf{x}}^{j} \in {\textbf{X}}_{t}^{q} \ \text {or} \ \forall {\textbf{x}}^{j} \in {\textbf{X}}_{t}^{k}, {\textbf{x}}^{i} \in {\textbf{X}}_{t}^{q}\\ 0, &{} \text{ otherwise } \end{array}\right. }. \nonumber \\ \end{aligned}$$
(17)

and \({\textbf{S}}\) represents the first-order difference discrimination matrix:

$$\begin{aligned} {\textbf{S}}=\sum _{k=1}^{C-1}\sum _{q=k+1}^{C}{\textbf{S}}_{kq} \end{aligned}$$
(18)

Discussion

The first-order difference discrimination \({\mathcal {L}}_{\textrm{FD}}\) is designed to remedy the weakness of the interdomain discrimination \({\mathcal {L}}_{\textrm{interD}}\).

Specifically, when the data distributions are nonuniform, the large intraclass distance components in \({\mathcal {L}}^{\textrm{intraC}}_{\textrm{interD}}\) and small interclass distance components of \({\mathcal {L}}^{\textrm{interC}}_{\textrm{interD}}\) contribute to the main reduction of the objective value \(\min _{{\textbf{P}}} {\mathcal {L}}_{\textrm{interD}}\) during the optimization process, which makes the solution \({\textbf{P}}\) biased toward these components, and against the other components. As a result, the classes that do not benefit much from \(\min _{{\textbf{P}}} {\mathcal {L}}_{\textrm{interD}}\) tend to be not aligned well across domains. In other words, the knowledge of those classes in the source domain may not transfer well to the target domain.

If we calculate the components of \({\mathcal {L}}_{\textrm{FD}}\), we find that the classes that are not well aligned tend to make the values of the corresponding components large, while the classes that are well aligned yield small values. Therefore, we propose \({\mathcal {L}}_{\textrm{FD}}\) as a supplement to \({\mathcal {L}}_{\textrm{interD}}\).

Optimization algorithm

The optimization of ZFOD alternately conducts the following steps: (i) solving problem (1) assigning the pseudolabels and (ii) improving the pseudolabels with the solution of (1). See the following subsections for the two steps. See Algorithm 1 for a summarization of ZFOD.

Algorithm 1
figure a

ZFOD.

Solving problem (1) given the pseudolabels

Substituting (6), (14), and (16) into (1) results in the following optimization problem:

$$\begin{aligned} \begin{array}{cl} \min \limits _{{\textbf{P}}} &{} {\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{X}} \left( {\textbf{M}}+{\textbf{W}}+\beta {\textbf{S}}\right) {\textbf{X}}^{\top } {\textbf{P}}\right) +\gamma \left\| {\textbf{P}}\right\| _F^2 \\ \text{ s.t. } &{} {\textbf{P}}^{\top } \textbf{X H} {\textbf{X}}^{\top } {\textbf{P}}={\textbf{I}}_{d} \end{array} \end{aligned}$$
(19)

where the constraint maximizes the embedded data variance as [24] and [19] did, \({\textbf{I}}_{d}\) is an identify matrix of dimension d and \({\textbf{H}}={\textbf{I}}_{(n_s+n_t)}-(1/(n_s+n_t)){\textbf{1}}_{(n_s+n_t)\times (n_s+n_t)}\).

The Lagrangian function of problem (19) is:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}({\textbf{P}}, \Theta )=&{\text {Tr}}\left( {\textbf{P}}^{\top }\left( {\textbf{X}} \left( {\textbf{M}}+{\textbf{W}}+\beta {\textbf{S}}\right) {\textbf{X}}^{\top }+\gamma {\textbf{I}}_{m}\right) {\textbf{P}}\right) \\&+{\text {Tr}}\left( \left( {\textbf{I}}_{d}-{\textbf{P}}^{\top } {\textbf{X}} {\textbf{H}} {\textbf{X}}^{\top } {\textbf{P}}\right) \Theta \right) \end{aligned} \end{aligned}$$
(20)

where \(\Theta ={\text {diag}}(\theta _1,\theta _2,...,\theta _d) \in {\mathbb {R}}^{d\times d}\) is a diagonal matrix with Lagrange multipliers. Solving \(\partial {\mathcal {L}}({\textbf{P}}, \Theta )/\partial {\textbf{P}} = 0\) derives the following optimal solution of (19):

$$\begin{aligned} {\textbf{P}}^{\star }=\left( {\textbf{X}} \left( {\textbf{M}}+{\textbf{W}}+\beta {\textbf{S}}\right) {\textbf{X}}^{\top }+\gamma {\textbf{I}}_{m}\right) ^{-1}\textbf{X H} {\textbf{X}}^{\top } {\textbf{P}}\Theta \end{aligned}$$
(21)

which is a generalized eigenvalue decomposition problem. In practice, we select the generalized eigenvectors of the right side of (21) corresponding to the d-smallest eigenvalues as the final \({\textbf{P}}^{\star }\).

Note that problem (19) can be easily generalized into the following kernel form through a kernel mapping \(\phi : {\textbf{X}} \rightarrow \phi ({\textbf{X}})\):

$$\begin{aligned} \begin{array}{cl} \min \limits _{{\textbf{P}}} &{} {\text {Tr}}\left( {\textbf{P}}^{\top } {\textbf{K}} \left( {\textbf{M}}+{\textbf{W}}+\beta {\textbf{S}}\right) {\textbf{K}}^{\top } {\textbf{P}}\right) +\gamma \left\| {\textbf{P}}\right\| _F^2 \\ \text{ s.t. } &{} {\textbf{P}}^{\top } \textbf{K H} {\textbf{K}}^{\top } {\textbf{P}}={\textbf{I}}_{d} \end{array} \end{aligned}$$
(22)

where \({\textbf{K}}=\phi ({\textbf{X}})^{\top } \phi ({\textbf{X}})\). It has a similar solution to (21).

Improving the pseudolabels given the solution in (1)

Inaccurate pseudolabels of the target domain result in suboptimal performance. To alleviate this problem, we use our previous work, named TSRP [27], to correct the errors of the pseudolabels. For the integrity of the paper, we present the TSRP algorithm briefly as follows:

As shown in Fig. 2, the TSRP first generates coarse pseudolabels for the target domain by a classifier \(f_{\textrm{init}}(\cdot )\) trained on the source samples \(\{({\textbf{z}}^{i}_s,y^i_s)\}_{i=1}^{n_s}\), and calculates the similarity of the samples in each pseudoclass as in Fig. 2a. Then, for each pseudoclass, it removes the samples with low confidence, as in Fig. 2a, b, where the confidence is defined as

$$\begin{aligned} S_{i,j}^{k}= \left\{ \begin{array}{ll} \frac{\langle {\textbf{z}}_{t}^{i}, {\textbf{z}}_{t}^{j}\rangle }{\Vert {\textbf{z}}_{t}^{i}\Vert \Vert {\textbf{z}}_{t}^{j}\Vert }, &{} i\ne j, {\hat{y}}_{t}^i = {\hat{y}}_{t}^j = k\\ 0, &{} \text {otherwise} \\ \end{array}\right. , \forall k=1,2,\ldots ,C \nonumber \\ \end{aligned}$$
(23)

Next, as shown as in the operation step from Fig. 2b–c, for each pseudoclass, TSRP connects the samples by spanning trees and then selects the samples of a spanning tree whose root sample has the maximum degree. In this way, the TSRP can further eliminate the negative effect of highly confident misclassified samples. Finally, it uses the selected target-domain samples with high confidence and the source-domain samples together to train a final classifier \(f_{\textrm{final}}(\cdot )\), which is used to generate the refined pseudolabels of the remaining samples with low confidence.

Fig. 2
figure 2

TSRP principle [27]. TSRP consists of two steps: (i) Deleting, which deletes the samples with low pairwise similarity scores, as shown in Fig. (a) to Fig. (b), and (ii) spanning tree, which selects samples with highly confident pseudolabels by spanning trees, as shown in Fig. (b) to Fig. (c)

Computational complexity

Assume ZFOD needs T iterations to converge; then the overall computational complexity of ZFOD is \({\mathcal {O}}(T((C^2+4)(n_s+n_t)^2+dm^2))\), which is proved as follows:

As shown in Algorithm 1, for each iteration, the computational complexity of ZFOD mainly consists of the following five parts:

  • Calculating the interdomain discrimination matrix \({\textbf{M}}\) spends \({\mathcal {O}}(0.5(C^2+C+2)(n_s+n_t)^2)\) time.

  • Calculating the intradomain discrimination matrix \({\textbf{W}}\) spends \({\mathcal {O}}(2(n_s+n_t)^2)\) time.

  • Calculating the first-order difference discrimination matrix \({\textbf{S}}\) spends \({\mathcal {O}}(0.5(C^2-C)(n_s+n_t)^2)\) time.

  • Solving the generalized eigen-decomposition problem takes \({\mathcal {O}}(dm^2)\) time.

  • TSRP costs \({\mathcal {O}}((n_s+n_t)^2)\).

As the experiments show, because T and the optimal d are usually small numbers, ZFOD can be solved in a polynomial time with respect to the number of data samples.

Storage complexity

The overall storage complexity of ZFOD is \({\mathcal {O}}((n_s+n_t)^2)\), which is proved as follows:

  • Storing the source and target data, which require an \({\mathcal {O}}((n_s+n_t))\) space.

  • Calculating the interdomain discrimination matrix \({\textbf{M}}\), which require an \({\mathcal {O}}((n_s+n_t)^2)\) space.

  • Calculating the intradomain discrimination matrix \({\textbf{W}}\), which require an \({\mathcal {O}}(n_s^2+n_t^2)\) space.

  • Calculating the first-order difference discrimination matrix \({\textbf{S}}\), which require an \({\mathcal {O}}((n_s+n_t)^2)\) space.

Table 1 Computational and storage complexities of the comparison methods

Experiments

In this section, we evaluate the performance of the proposed ZFOD on several popular visual cross-domain benchmarks. The source code of ZFOD is available at https://github.com/02Bigboy/ZFOD.

Datasets and cross-domain tasks

The experiments were performed on 4 datasets. The detailed information of the datasets are introduced as follows:

Office+Caltech10 [29] contains four domains: Amazon, Webcam, DSLR and Caltech-256, which share the same 10 classes. The dataset has 2533 images in total.

Office-31 [40] consists of three domains: Amazon (A), Webcam (W) and DSLR (D). It contains 4110 images with 31 categories in total.

ImageCLEF-DA [41] has three domains: Caltech-256 (C), ImageNet ILSVRC 2012 (I), and Pascal VOC 2012 (P). It contains 12 classes, each of which has 50 images from three domains.

Office-Home [42] includes 65 object classes from four domains, i.e., Artistic images (A), Clipart (C), Product images (P) and Real-World images (R). There are a total of 15,588 images.

Table 2 Average classification accuracy (%) of the comparison methods on the target domains of the Office+Caltech10 dataset, where A = Amazon, C = Caltech, D = DSLR And W = Webcam. The highest accuracy on a cross-domain task is marked in bold

For the Office+Caltech10 dataset, we used the 4096-dimensional DECAF-6 feature [54]. For the other datasets, we used the 2048-dimensional ResNet-50 feature [32].

Experimental settings

For the proposed ZFOD, we applied the following parameter setting to all comparison experiments. Specifically, we set the hyperparameters \(\beta =1.5\), \(\gamma =1\), and \(\rho =0.1\). We set the dimension of the subspace d to 100, and limited the number of the optimization iterations to be no larger than 10, i.e., \(N=10\). As the ablation study shows, ZFOD is insensitive to the hyperparameter setting.

To evaluate the effectiveness of the proposed algorithm, nine representative conventional methods and seven remarkable deep learning-based methods were compared.

Conventional UDA methods

we compare nine conventional UDA methods with ZFOD:

  • Classic UDA methods:

    1. 1.

      1-nearest neighbor classifier (1NN) [28].

    2. 2.

      Geodesic flow kernel for domain adaptation (GFK) [29].

    3. 3.

      Joint geometrical and statistical alignment (JGSA) [30].

    4. 4.

      Manifold embedded distribution alignment (MEDA) [20].

  • Method of reducing the intraclass distance between domains:

    1. 1.

      Joint distribution adaptation (JDA) [19].

  • Method of reducing the intraclass distance between domains and reducing the intraclass distance within domains.

    1. 1.

      Domain-irrelevant class clustering (DICE) [10].

  • Method of reducing the intraclass distance and enlarging the interclass distances within domains.

    1. 1.

      Minimum centroid shift (MCS) [31].

  • Methods of reducing the intraclass distance between domains, enlarging the interclass distances within domains and reducing the intraclass distances within domains.

    1. 1.

      Discriminatve transfer feature and label consistency (DTLC) [25].

    2. 2.

      Domain invariant and class discriminative (DICD) [24].

    3. 3.

      Target similarity for pseudolabels (TSRP) [27].

Note that the proposed ZFOD is closely related to DICD. When ZFOD removes the interclass distance across domains and first-order difference discrimination, ZFOD degrades into DICD.

We list the computational and storage complexities of the proposed methods and some comparison methods in Table 1. The table shows that when the number of classes (C) is small, the complexity of the proposed method is similar to those of the comparison methods.

Deep learning-based UDA methods

They are the deep adaptation networks (DAN) [33], residual transfer networks (RTN) [34], multiadversarial DA (MADA) [35], conditional domain adversarial network(CDAN) [36], joint adaptation networks (JAN) [37], Collaborative and adversarial network (iCAN) [38], and maximum classifier discrepancy (MCD) [39], respectively. Note that we also added Resnet-50 [32], which use the classifier trained on the source domain to the test domain directly without applying a specially designed UDA algorithm as a baseline.

Table 3 Average classification accuracy (%) of the comparison methods on the target domains of the Office-31 and ImageCLEF-DA datasets
Table 4 Average classification accuracy (%) of the comparison methods on the target domains of the Office-Home dataset

The classification accuracy on the target domain is used as the evaluation metric. For a fair comparison, the reported results of the comparison methods were either from their original papers or produced by the publicly available codes of the methods.

Main results

Table 2 lists the classification accuracy of the comparison methods on the domain adaptation tasks of the Office+ Caltech10 dataset. The table shows that ZFOD yields the highest average accuracy, which achieves an absolute improvement of \(0.4\%\) over the best competitor DTLC. For the results on each domain adaptation task, ZFOD achieves the highest accuracy on 6 out of 12 tasks.

In particular, the average accuracy of ZFOD is \(3.5\%\) higher than that of the closely related comparison method DICD. If we look at the tasks in detail, we find that ZFOD is superior to DICD for all tasks except for “D\(\rightarrow \)W”. For example, the accuracy of ZFOD over DICD is improved from \(83.4\%\) to \(95.5\%\) on the “A\(\rightarrow \)D” task, and from \(93.6\%\) to \(98.7\%\) on the “C\(\rightarrow \)D” task.

Table 3 shows the classification accuracy of the comparison methods on the Office-31 and ImageCLEF-DA datasets. From the table, we see that ZFOD reaches an average accuracy of \(88.1\%\), which is \(0.2\%\) higher than the runner-up comparison method DTLC. It achieves the best results on 3 out of 12 tasks. It is worthy mentioning that ZFOD is superior to DICD in all tasks, and reaches an average accuracy of \(3.5\%\) higher than that of DICD. Especially for the “A\(\rightarrow \)D” task of the Office-31 dataset and the “P\(\rightarrow \)I” task of the ImageCLEF-DA dataset, the accuracy has been increased from 81.7\(\%\) and 81.2\(\%\) to 91.0\(\%\) and 91.5\(\%\), respectively. Besides, ZFOD outperforms the deep learning-based methods. Its average accuracy is \(0.8\%\) higher than the best deep model iCAN.

Table 4 lists the classification accuracy of the comparison methods on the Office-home dataset. From the table, we observe similar phenomena as those on the other datasets. Specifically, ZFOD reaches the highest average accuracy, which is \(0.8\%\) higher than the best referenced method EeasyTL. It achieves the highest accuracy on 3 out of 12 tasks, while EeasyTL wins none of the tasks. Compared to the deep learning-based methods, ZFOD outperforms all deep methods with an absolute accuracy improvement of at least \(1.6\%\). For example, ZFOD outperforms CDAN in 7 out of 12 tasks. The accuracy is improved from 66.0\(\%\) to 70.8\(\%\) on the “Cl\(\rightarrow \)Pr” task, and from 55.6\(\%\) to 62.5\(\%\) on the “Pr\(\rightarrow \)Ar” task.

Table 5 Properties of the comparison methods

The above phenomena, which demonstrate the advantage of ZFOD, can be explained as follows: ZFOD not only maximizes the data discrimination in four aspects but also aligns the data distributions of the two domains by the so-called first-order difference discrimination, which may result in better domain-invariant and discriminative representations than the comparison methods. In contrast, some excellent comparison methods, such as MCS, DTLC, and DICD, only maximize part of the data discrimination. Their interdomain alignment is also limited to the data and do not refer to the levels of high-order data discrimination. See Table 5 for a summary of the differences between ZFOD and some representative conventional methods that also optimize the data discrimination to some extent.

Ablation study

ZFOD contains three novel points: TSRP-based pseudolabel generation, first-order difference discrimination, and zeroth-order discrimination. In this subsection, we conducted an ablation study by removing the novel points from ZFOD one by one. The ZFOD without the TSRP-based pseudolabel generation method is denoted as “ZFOD without TSRP”. The ZFOD without the first two novel points is denoted as “ZFOD without TSRP and FOD”. If we remove all three novel points, the most closely related method is DICD [24]. When the zeroth-order discrimination removes the interclass distance across domains, then the remaining three discrimination subitems are the same type of subitems as those in DICD. Therefore, we compared the above ZFOD variants with DICD.

The experiments were conducted on the four UDA datasets. For each dataset, we randomly selected two tasks as representatives. The comparison results are listed in Table 6 and analyzed as follows:

Effect of the TSRP-based pseudolabel generation method on performance

From the comparison results between “ZFOD without TSRP” and ZFOD, we see that the performance of “ZFOD without TSRP” is apparently worse than ZFOD, which indicates that the TSRP-based pseudolabel generation method can improve the accuracy of the pseudolabels, and the correctness of the pseudolabels is very important for ZFOD to learn a domain-invariant and discriminative feature.

Effect of first-order discrimination on performance

From the comparison results between “ZFOD without TSRP” and “ZFOD without TSRP and FOD”, we see that the former shows slightly better or at least similar performance with the latter on the tasks of Office+Caltech10, Office-31 and ImageCLEF-DA and significantly outperforms the latter on the tasks of Office-home, which shows the effectiveness of the first-order discrimination and that the first-order discrimination can further help align the distribution of the source domain and target domain.

Effect of zeroth-order discrimination on performance

From the comparison results between “ZFOD without TSRP and FOD” and DICD, we find that the zeroth-order discrimination, which has an extra interdomain interclass distance discrimination subitem beyond DICD, outperforms DICD on Office+Caltech10, Office-31 and ImageCLEF-DA. However, it is significantly worse than DICD on Office-Home, which indicates that the advantage of combining all four subitems of the zeroth-order discrimination over the subset of the subitems is not guaranteed due to the discrimination inconsistency problem analyzed in Sect. “Discussion”.

After comparing the results between “ZFOD without TSRP and FOD” and DICD, we find that there is no guarantee that “ZFOD without TSRP and FOD” is always better than DICD due to the discrimination inconsistency problem. However, we further find that “ZFOD without TSRP” outperforms DICD apparently in all tasks, which indicates that the first-order discrimination overcomes the discrimination inconsistency problem of the zeroth-order discrimination.

Table 6 Ablation study for the components of ZFOD
Fig. 3
figure 3

Effect of the hyperparameters on four domain adaptation tasks. Different colors represent different domain adaptation tasks

Effects of hyperparameters on performance

ZFOD has four hyperparameters \(\beta \), \(\gamma \), \(\rho \), and d. In the previous sections, we used the same hyperparameter setting in all experiments. To study whether the performance of ZFOD is sensitive to different hyperparameter settings, we conducted a grid search for each parameter. During the grid search of one parameter, we fixed the other parameters to their default values. The experiments were conducted on randomly selected tasks of the four UDA datasets: “C \(\rightarrow \) W (Office-Caltech10)”, “P \(\rightarrow \) I (ImageCLEF-DA)”, “A \(\rightarrow \) W (Office-31)” and “Ar \(\rightarrow \) Pr (Office-Home)”. The results are shown in Fig. 3 and analyzed as follows:

Fig. 4
figure 4

Visualization of the data distributions produced by DICD and ZFOD on the “A \(\rightarrow \) D (Office+Caltech10)” task. Different colors represent different categories. The source samples are marked by the symbol “o”. The target samples are marked by “+”

Fig. 5
figure 5

Visualization of the data distributions produced by DICD and ZFOD on the “W \(\rightarrow \) C (Office+Caltech10)” task

The hyperparameter \(\beta \) is a hyperparameter for the first-order discrimination. The larger \(\beta \) is, the more important the first-order discrimination behaves in ZFOD. We investigated \(\beta \) in a wide range of [0, 3]. From the result in Fig. 3, we can see that, when we increase \(\beta \) gradually from 0 to 3, the performance of ZFOD improves steadily, especially on the large dataset Office-home. This phenomenon proves that the first-order difference discrimination facilitates the alignment of the source and target domains. Because ZFOD achieves relatively stable results when \(\beta \in [0.5,1.5]\), we chose \(\beta =1.5\) as the default value.

The hyperparameter \(\gamma \) balances the discrimination loss functions and the complexity of the projection matrix \({\textbf{P}}\). We studied \(\beta \) by a grid search of \(\{0.1, 0.5, 1, 5, 10, 50, 100\}\). The figure shows that all accuracy curves tend to first rise up and then move down when \(\gamma \) is gradually increased. Eventually, we pick \(\gamma =1\), which tends to yield the best performance for all tasks.

Fig. 6
figure 6

Cross-domain similarity matrices produced by DICD and ZFOD on the “A \(\rightarrow \) D (Office+Caltech10)” task

Fig. 7
figure 7

Cross-domain similarity matrices produced by DICD and ZFOD on the “W \(\rightarrow \) C (Office+Caltech10)” task

The hyperparameter d defines the dimension of the subspace. We searched d in a wide range of [10, 200]. The figure shows that with the increasing d, the accuracy curves first gradually rise and then tend to be stable. Finally, we chose \(d = 100\), which balances the classification performance and computational complexity.

The hyperparameter \(\rho \) in (15) balances the intradomain intraclass discrimination and intradomain interclass discrimination. We studied \(\rho \) by a grid search of \(\{0.01, 0.05, 0.1, 0.5, 1, 2, 5\}\). The figure shows that ZFOD performs steadily when \(\rho \in [0.01, 0.5]\), and drops sharply on the task of “Ar \(\rightarrow \) Pr (Office-Home)” when \(\rho >0.5\). Therefore, we set \(\rho =0.1\) as a safe default value.

From the above analysis, we conclude that ZFOD is insensitive to the hyperparameter selection. Although the performance of ZFOD can be further improved beyond the results of the previous experimental subsections by carefully selecting a hyperparameter setting per task, we did not do so for the real-world applications of ZFOD.

Data visualization

In this subsection, we visualize the distributions and pairwise similarity matrices of data produced by DICD and ZFOD on two randomly selected tasks, which are the “A \(\rightarrow \) D (Office-Caltech10)” and “W \(\rightarrow \) C (Office-Caltech10)” tasks.

Figures 4 and 5 visualize the data distributions. From the two figures, we see that the data distributions produced by ZFOD have larger interclass distances than those produced by DICD.

Figures 6 and 7 show the similarity matrices of all data across domains, where the similarity between samples is measured by the cosine similarity. A similarity matrix across domains contains three parts: a source-domain similarity matrix which is at the upper-left corner of the full similarity matrix, a target-domain similarity matrix which is at the lower-right corner, and two cross-domain similarity matrices which are at the upper right and lower left corners.

Figures 6a and 7a show that the sample similarity of the cross-domain similarity matrices produced by DICD is quite high for some different classes, which indicates that DICD does not focus enough on domain alignment. In contrast, Figs. 6b and 7b show that ZFOD alleviates this problem, which indicates the effectiveness of the first-order discrimination.

Moreover, Figs. 6a and 7a show that the sample similarity produced by DICD in the source- and target-domain similarity matrices is high for the samples in the same class and not discriminative enough for the samples in different classes. In contrast, Figs. 6b and 7b show that ZFOD not only yields high intraclass similarity as DICD does, but also produces lower interclass similarity than DICD, which indicates the advantage of the zeroth-order discrimination.

Conclusion

In this paper, we proposed the zeroth- and first-order difference discrimination algorithm for unsupervised domain adaptation. It contains three novel components: zeroth-order discrimination, first-order difference discrimination, and TSRP-based pseudolabel generation. The zeroth-order discrimination consists of interdomain discrimination and intradomain discrimination, each of which is further divided into interclass discrimination and intraclass discrimination. The novelty of zeroth-order discrimination is that it covers four important aspects of data discrimination that have not been considered in the literature to our knowledge. Because the interdomain discrimination only maximizes the cross-domain discrimination at the data level without aligning the interclass distances of the source and target domains, the first-order difference discrimination was proposed to overcome the weakness. Because all of the discrimination terms use pseudolabels to define the pseudoclasses at the target domain, it is important to generate highly accurate pseudolabels. Eventually, the TSRP-based pseudolabel generation method is applied. Its core idea is to iteratively pick the pseudolabels with high confidence to train a strong classifier, which is then used to correct the remaining pseudolabels or improve their confidence. We conducted an extensive comparison with nine state-of-the-art conventional UDA methods and seven representative deep learning-based UDA methods. The comparison results demonstrate the effectiveness of the proposed method. The ablation studies further confirm the effectiveness of each novel component of ZFOD, e.g., in improving the discrimination of the domain-invariant feature and aligning the source and domains well.