1 Introduction

Due to the increasing demand for information security and privacy protection, data sharing across domains becomes less possible. Also, conventional unsupervised domain adaptation (UDA) setting is questioned recently in the term of necessity of access to the source domain (Chidlovskii et al., 2016; Lao et al., 2021; Tanwisuth et al., 2021). In this context, model transfer turns out to be promising (Kim et al., 2021; Li et al., 2020; Liang et al., 2020). This is known as Source-Free Domain Adaptation (SFDA), which aims to adapt a pretrained model to a target scenario in an unsupervised manner without access to source domain training data.

Existing SFDA methods rely on mining auxiliary information in order to enable the adoption of well-established feature alignment algorithms (Long et al., 2018; Hoffman et al., 2018) following two strategies. The first one creates a fake source domain by generative models (Li et al., 2020) or by source hypothesis-based target data splitting (Du et al., 2021), and further aligns the pseudo source data and the target data in feature space like UDA. The second one performs self-supervised learning to transfer the source model to the target domain. In practice, the techniques of pseudo-labels (Liang et al., 2020), source prototypes (Tanwisuth et al., 2021), and target geometric information (Tang et al., 2022) are used to guide the self-supervised learning. Essentially, the two strategies perform a feature distribution alignment in an explicit way (the first one) or an implicit way (the second one). In Fig. 1a, we illustrate the distribution alignment process. The given source model gives a prediction for the source feature distribution marked in orange. The feature distribution alignment encourages the embedded target data (marked in green) to move/cluster toward the correct class cluster of the source feature distribution. Thus, the frozen classifier in the source model can correctly predict categories for the target data. However, aligning the feature distribution for SFDA is challenging at the absence of source domain training data and target domain labels. First of all, these auxiliary information is error-prone, suffering from further error propagation risk. Furthermore, this limitation would be easily amplified typical to existing SFDA methods due to lacking error mitigation.

Fig. 1
figure 1

Paradigm comparison: a Conventional feature distribution alignment versus b our target prediction distribution (\({\hat{P}}_{\mathrm{{T}}}\)) search for matching the ideal target distribution. This searching is driven by a progressive search strategy with error control: c we construct a flow of proxy distributions (\(\{P_k\}_{k=1}^{K-1}\)) with sufficiently small shift (close) in-between to connect the source distribution (\({P}_{\mathrm{{S}}}\)) and the ideal unknown target distribution (\({P}_{\mathrm{{T}}}\)). d Considering \({{P}_{\mathrm{{S}}}}\) and \({P}_{\mathrm{{T}}}\) as two different points in a metric space, in essence our method is searching along the geodesic path between the two points. Critically, it can be proven that by searching along this path subject to minimizing any adjacent proxy distributions in the flow, \({\hat{P}}_{\mathrm{{T}}}\) could closely match \({P}_{\mathrm{{T}}}\) well (Theorem 1)

To overcome the aforementioned foundational limitation, in this work a novel Target Prediction Distribution Searching (TPDS) paradigm is introduced. We reformulate the SFDA problem as searching the target prediction distribution, in contrast to conventional feature distribution alignment (Fig. 1b). The target prediction distribution is formed as the model output of all the unlabeled training samples from the target domain. The key challenge is how to mitigate the misleading effect caused by the unknown errors of predicted label distributions. To tackle this obstacle, we search a proxy \({{{\hat{P}}}_{T}}\) under an approximated condition where the source and target domains share the same distribution. That is, the adaption error needs to be minimized. To achieve this, we introduce a progressive search strategy based on a flow of proxy distributions with adjacent ones being slightly shifted (Fig. 1c). As a result, a typically large distribution gap from source to target domain can be gradually shifted away in multiple stages. Essentially, understand the \({P}_{\mathrm{{S}}}\) and \({P}_{\mathrm{{T}}}\) as two distinct points in a metric space, the searching induced by TPDS aims to find an optimal geodesic path in-between with minimal accumulative errors (Fig. 1d). Critically, we prove theoretically that when the distribution shift on this path is sufficiently small, the transfer error across the two domains could be well bounded (Theorem 1).

We further instantiate a TPDS model in deep learning. Concretely, we split the whole training process evenly into multiple stages. Each stage corresponds to a single-step searching driven by aligning two adjacent proxy distributions in this flow. To that end, we design a new algorithm named Pairwise Alignment with Category Consistency (PACon). More specifically, manifold geometry guided credible sampling discovers the potential data pairs (i.e., shift estimation), followed by mutual information maximization based optimization for shift reduction.

The contributions of this work are summarized as follows:

  1. (1)

    We propose a novel TPDS paradigm for SFDA without high reliance on the accuracy of source domain auxiliary information. Critically, TPDS comes with theoretical guarantee on adaption error mitigation, which is largely lacking in previous feature distribution alignment based alternatives.

  2. (2)

    To mitigate the cross-domain transfer error, we develop a new PACon method to align any two adjacent distributions in a flow. Unlike the popular shift measures such as MMD or KL-divergence, PACon encourages pairwise alignment with explicit geometric semantics intrinsic to adjacent distributions.

  3. (3)

    We evaluate the proposed approach on five challenging domain adaptation benchmarks. The extensive experiments show that our TPDS yields new state-of-the-art results.

The rest of the paper is organized as follows. Section 2 introduces the related work. Section 3 details the proposed paradigm, followed by the model instantiation in Sect. 4. Section 5 presents the experimental results and analyses. Section 6 draws the conclusion in the end.

2 Related Work

2.1 Unsupervised Domain Adaptation

For UDA, the key is to reduce the domain drift. Since the source and target data are accessible during the transfer phase, probability matching becomes the main idea to solve this problem. Based on whether to use a deep learning algorithm, current work in UDA can be divided into two categories: 1) deep-learning-based and 2) non-deep-learning based. In the first category, researchers rely on techniques such as metric learning to reduce domain drift (Long et al., 2015, 2018; Pan et al., 2019). In these methods, an embedding space with unified probability distribution was learnt by minimizing certain statistical measures, e.g., MMD (maximum mean discrepancy) (Tzeng et al., 2014), which were used to evaluate the discrepancy of the domains. In addition, adversarial learning has been another popular framework for its capability of aligning the probabilities of two different distributions (Hoffman et al., 2018; Zhang et al., 2019; Munro & Damen, 2020). The second category reduces the drift in diverse manners. For example, focusing on the energy, an energy distribution-based classifier was developed to detect the confidence target data (Tang et al., 2019). The structural knowledge (Xia et al., 2022), category contrast (Huang et al., 2022) and spectral information (Zhang et al., 2022) were exploited to boost the adaptation. In all the aforementioned methods, the source data is indispensable as labeled samples were used to explicitly formulate domain knowledge (e.g., probability, structural structure, energy or spectral information). When the labeled data in the source domain are not available, these conventional UDA methods fail.

2.2 Source-Free Domain Adaptation

The current mainstream approach to SFDA follows the paradigm of feature distribution alignment. Existing methods can be generally divided into two classes. The first class performs explicitly feature alignment by converting SFDA to the conventional UDA problem (Li et al., 2020; Du et al., 2021; Tian et al., 2022). These methods reconstruct a fake source domain under some source hypothesis, and further align the target data to the pseudo source data in the feature space. The second class conducts alignment implicitly by adapting the source model to the target domain based on self-supervised learning. In the absence of source domain data, the source model is used to generate an auxiliary factor, such as hard samples (Li et al., 2022) or prototypes (Tanwisuth et al., 2021), to assist feature alignment. On the other hand, alternative methods mine the auxiliary information from the target domain data. Except for widely used pseudo-labels, e.g., clustering-based pseudo-labels generation (Liang et al., 2020), pseudo-labels denoising (Chen et al., 2021; Ahmed et al., 2022), geometry information like the intrinsic neighborhood structure (Yang et al., 2021) and data manifold (Tang et al., 2022) have also been exploited for guiding model adaptation. In contrast to all the previous methods, we introduce a novel target prediction distribution search paradigm conceptually different from feature distribution alignment.

2.3 Gradual Domain Adaptation

In transfer learning, the most relevant work with ours is Gradual Domain Adaptation (GDA) performing knowledge transfer in the time dimension (e.g., years). In this setting, the variety dynamics is given and represented by a series of intermediate unlabeled domains between source domain and target domain. At the high level, Gradual self-training (GST) is the main strategy. There are two main research lines. The first line extends the GST framework to address a variety of GDA cases, such as the scenario without some intermediate domains (Abnar et al., 2021) or without the predefined intermediate domain index (Chen et al., 2021). The second line (Kumar et al., 2020; Wang et al., 2022) focus on understanding GDA with theoretical analysis. Our work differs significantly from GDA due to no intermediate domains, rendering previous GDA methods inapplicable.

2.4 Progressive Transfer in Domain Adaptation

Existing progressive transfer methods for domain adaptation can be split into three groups. The first group is subspace-based, assuming that the source and target domains are two points on a manifold. For reducing their domain gap, subspaces along the geodesic path are interpolated to connect the two points  (Gopalan et al., 2011; Caseiro et al., 2015; Cui et al., 2014). The second group is gradual learning-based (e.g., curriculum learning (Roy et al., 2021), deep clustering (Liang et al., 2021)). In an epoch-wise training fashion, they use the previous epoch model to guide the current training epoch. The third group is domain generation based. The core idea is to generate a flow of intermediate smoothly-shifting domains capable of bridging the domain gap between the source and target domains (Gong et al., 2019). Our method belongs to this group. Importantly, we highlight the key novel designs in comparison: (1) We form the intermediate domains with simple yet reliable probability distributions; (2) We uniquely take into account error control to the progressive learning process; (3) Our formulation is tailored for source free domain adaption, without the need for accessing the source data as required in (Gong et al., 2019).

3 Methodology

In this section, we first formulates the SFDA problem, and then formalize target prediction distribution searching. Finally, we present the optimization analysis for a single-step searching in the matching process.

3.1 Source Data-Free Domain Adaptation Formulation

Given two different but related domains, i.e., the source domain \(\mathrm{{S}}\) and target domain \(\mathrm{{T}}\). Let source \({\mathcal {X}}_s=\{{\varvec{x}_{i}^s\}_{i=1}^{n_s}}\) and \({\mathcal {Y}}_s=\{{y}_{i}^s\}_{i=1}^{n_s}\) be the source samples and the corresponding labels. The target data and their labels are \({\mathcal {X}}_t=\{{\varvec{x}_{i}\}_{i=1}^{n}}\) and \({\mathcal {Y}}_t=\{{y}_{i}\}_{i=1}^{n}\), respectively, in which n is the number of the target data. Both domains remain the same C-way classification task. In the SFDA setting, suppose a source model \(\theta _s:{\mathcal {X}}_s \mapsto {\mathcal {Y}}_s\) is pre-learned by \(\left( {\mathcal {X}}_s, {\mathcal {Y}}_s\right) \), we intend to learn a target model \(\theta _t: {\mathcal {X}}_t \mapsto {\mathcal {Y}}_t\) through an adaptation to the target domain. During the transfer process, only the source model \(\theta _s\) and the unlabeled target data \({\mathcal {X}}_t\) are available.

3.2 Target Prediction Distribution Searching

Unlike conventional feature distribution alignment, we reformulate the SFDA problem as searching the optimal target prediction distribution. We start with the initial prediction distribution \({P}_{\theta _s}\), obtained by applying the source model \(\theta _s\) on \({\mathcal {X}}_t\). The objective of our TPDS is to identify the ideal prediction distribution \(P_{\mathrm{{T}}}\) (unknown), typically different significantly from \({P}_{\theta _s}\) (i.e., large distribution shift/gap). We formulate this as a distribution optimization problem as:

$$\begin{aligned} \arg \min _{\Theta } \mathrm{{D}}\left( {{\hat{P}}}_{\mathrm{{T}}}, {P_{\mathrm{{T}}}} \right) , {{\hat{P}}}_{\mathrm{{T}}} = \mathrm{{SE}}({P}_{\theta _s}) \end{aligned}$$
(1)

where \({{\hat{P}}}_{\mathrm{{T}}}\) specifies the estimated target prediction distribution, \(\mathrm{{SE}}(\cdot )\) stands for the searching process starting with \({P}_{\theta _s}\), \(\mathrm{{D}}\left( \cdot ,\cdot \right) \) measures the discrepancy of two distributions, and \(\Theta \) refers to the parameters to be learned. Two key challenges for this optimization are that (1) \(P_{\mathrm{{T}}}\) is unknown making it hard to optimize, and (2) domain gap between \({P}_{\theta _s}\) and \(P_{\mathrm{{T}}}\) are large making one-step searching for \({{\hat{P}}}_{\mathrm{{T}}}\) hard to achieve good result.

To overcome both of the challenges above, inspired by the spirit of gradual adaptation (Liang et al., 2020; Abnar et al., 2021), we design a progressive search strategy. Specifically, from the distribution \({P}_{\theta _s}\) to \(P_{\mathrm{{T}}}\), we construct a proxy prediction distribution flow \(P_{\theta _{0}} \rightarrow P_{\theta _{1}} \cdots \rightarrow P_{\theta _{k}} \rightarrow \cdots \rightarrow P_{\theta _{K}}\), where \(P_{\theta _{0}}={P}_{\theta _s}\) with \(\theta _{0}=\theta _{s}\), \(P_{\theta _{K}}={{\hat{P}}}_{\mathrm{{T}}}\) with \(\theta _{K}=\theta _{t}\), \(\theta _{k}\) represents the k-th intermediate model estimating the proxy distribution \(P_{\theta _{k}}\). Consider a metric space induced by measure \(\mathrm{{D}}\) where the \({P}_{\theta _s}\) and \(P_{\mathrm{{T}}}\) can be deemed as two distinct points. In this case, this proxy distribution flow connecting \({P}_{\theta _s}\) and \(P_{\mathrm{{T}}}\) specifies the searching route. Clearly, there are a number of possible choices, but which one can lead to the best matching? Intuitively, the geodesic path is optimal due to that its accumulative domain shift is minimal. In fact, we can theoretically prove the rationality of choosing the geodesic path as follows.

First of all, we select a proper measure to quantify the distribution shift. Of note, the \(P_{\theta _k}\) searching taking \(P_{\theta _{k-1}}\) as the start has one key point that \(P_{\theta _{k}}\) inherits from \(P_{\theta _{k-1}}\), and its shift from \(P_{\theta _{k-1}}\) is small enough. Namely, we can regard the geometric shape of \(P_{\theta _k}\) as derived from \(P_{\theta _{k-1}}\) by a slight geometric change. Under this context, this work does not adopt the popular MMD, but selects a Wasserstein distance as the shift measure \(\mathrm{{D}}\) for two reasons. First, due to having inherent geometric meaning, in theory, Wasserstein distance is more reasonable than others when the two adjacent distributions have a certain geometric relation (Mueller & Jaakkola, 2015). Second, some work (Shen et al., 2018) have verified that Wasserstein distance is better than MMD in the domain adaptation problem. Considering that SFDA is a classification-orientated problem, we use a Wasserstein-infinity distance-based measure (Kumar et al., 2020), denoted by \(D_w(\cdot ,\cdot )\). For any adjacent proxy distributions, the measure of their shift for C-way classification is expressed as

$$\begin{aligned} \begin{aligned}&\mathrm{{D}}_w\left( P_{\theta _{k-1}},P_{\theta _{k}} \right) = \max \{ {{d_0}, \ldots ,{d_c}, \ldots {d_{C - 1}}} \},\\&{d_c} = {W_\infty }\left( {{P_{{\theta _{k - 1}}}}\left( { X_1 \mid Y_1 = c} \right) ,{P_{{\theta _k}}}\left( { X_2 \mid Y_2 = c} \right) } \right) , \end{aligned} \end{aligned}$$
(2)

where \({W_\infty }(\cdot ,\cdot )\) is the Wasserstein-infinity distance, random variables \(X_1\) and \(X_2\) stand for the samples satisfying \({P_{{\theta _{k-1}}}}\) and \({P_{{\theta _{k}}}}\) respectively, random variables \(Y_1\) and \(Y_2\) denote the category, the conditional distribution \({{P_{{\theta _{k - 1}}}}}\!\left( { X_1 \mid Y_1 \!= \!c} \right) \) and \({{P_{{\theta _{k}}}}}\!\left( { X_2 \mid Y_2 \!= \!c} \right) \) are probability measures on the c-th category by the \({{{{\theta _{k - 1}}}}}\) and \({{{{\theta _{k}}}}}\) respectively.

With the measure presented in Eq. (2) and the theoretical results in (Kumar et al., 2020), we derive the following Theorem for the transfer performance upper bound of our progressive searching. The proof is given in Appendix A.

Theorem 1

Suppose the distributions in the proxy distribution flow \(\{ {P_{\theta _{k}}} \}_{k=0}^{K}\) satisfy no label shift (the C categories are fixed) and the data is bounded (the data is not too large: \(\Vert \varvec{x}_i\Vert _2^2 \le \rho \), \(\rho >0\) for \( 1 \le i \le n\)). Distribution shifts in this flow are \(\Pi = \{ \pi _k \}_{k=1}^{K}\) where \(\pi _k\) change gradually from 1 to K, and \(\pi _m = \max (\Pi )\). If the source model \(\theta _s=\theta _0\) has low loss \(\alpha _0 \ge \alpha ^*\) on the source domain, then

$$\begin{aligned} \begin{aligned} L\left( {{\theta _K},{P_{{\theta _K}}}} \right) \!\le \! {\left( {{2 \over {1 - \pi _m R}}} \right) ^{K + 1}} \!\! \left( {{\alpha _0} + O\!\left( {{1 \over {\sqrt{n} }}} \right) } \!\right) , \end{aligned} \end{aligned}$$
(3)

where \(L\left( {{\theta _K},{P_{{\theta _K}}}} \right) \) is the objective loss as learning \({\theta _K}\) for \({P_{{\theta _K}}}\) prediction, R stands for the regularization strength of \({\theta _K}\), \(\alpha ^{*}\) is a given small loss, n is the size of the target dataset.

Since \({P_{\mathrm{{T}}}}\) in Eq. (1) is unknown, we cannot directly evaluate \(\mathrm{{D}}( {{\hat{P}}}_{\mathrm{{T}}}, {P_{\mathrm{{T}}}})\). We instead analyze the objective loss for predicting \({P_{{\theta _K}}}\) on the target domain, i.e., \(L\left( {{\theta _K},{P_{{\theta _K}}}} \right) \). In case of \(L\left( {{\theta _K},{P_{{\theta _K}}}} \right) =0\), we arrive \({{\hat{P}}}_{\mathrm{{T}}}={P_{\mathrm{{T}}}}\).

Remark

Theorem 1 suggests an insight that reducing the maximal distribution shift \(\pi _m\) of this flow can lower the empirical risk of the resulting distribution \({P_{{\theta _k}}}\) on the target domain. In practice, \(\pi _m\) is not determined until the end of search. To overcome this challenge, we propose to require all distribution shifts \(\{\pi _k\}_{k=1}^{K}\) of this flow to be sufficiently small, so that minimizing the final empirical risk can be approximated. Under the geometry view in the metric space as discussed earlier, our design means that the searching should be along the geodesic path for best matching between \({{\hat{P}}}_{\mathrm{{T}}}\) and \({P_{\mathrm{{T}}}}\).

Together with the aforementioned geometry principle, we realize the proposed learning strategy by transforming the original optimization (Eq. (1)) to the following K sub-problems:

$$\begin{aligned} \begin{aligned} \arg \min _{\theta _k} \mathrm{{D}}_w\left( P_{\theta _{k-1}},P_{\theta _{k}} \right) , ~ k=1,\dots ,K. \end{aligned} \end{aligned}$$
(4)

This defines a search process along the geodesic path, as shown in Fig. 1d. Specifically, the k-th sub-problem refers to a single-step search that computes the current distribution \({P_{\theta _{k}}}\), given the previous distribution \({P_{\theta _{k-1}}}\) formed by model \(\theta _{k-1}\). The entire search process of TPDS yields a proxy distribution flow with sufficient small shift in-between.

3.3 Single-Step Searching

As indicated by Eq. (4), a single-step searching is driven by aligning the adjacent distributions, namely minimizing the distance \(\mathrm{{D}}_w\left( P_{\theta _{k-1}},P_{\theta _{k}} \right) \). In practice, we do not adopt the original definition in Eq. (2) to estimate this distance. According to Eq. (2), we need to iteratively compute the Wasserstein-infinity distances of all C categories and take the maximal one as the distribution shift. However, in our context no accurate category information is available. To solve this problem, we propose to minimize all the distances \(\{ {{d_0}, \ldots , {d_c}, \ldots {d_{C - 1}}} \}\), without the need for the category information. This leads to the following reformulation:

$$\begin{aligned} \begin{aligned}&d_k = {W_\infty }\left( {{P_{{\theta _{k - 1}}}}\left( {\left. X_1 \right| Y} \right) ,{P_{{\theta _k}}}\left( {\left. X_2 \right| Y} \right) } \right) , \\&with~Y = Y_1 = Y_2. \end{aligned} \end{aligned}$$
(5)

This category-agnostic formula allows us to compute the distribution shift in an unsupervised manner, facilitating the subsequent analysis for shift reduction.

For \(d_k\) minimization, we have to figure out two issues: (1) what kind of data support the \(d_k\) estimation, and (2) how to reduce \(d_k\) based on the found data. For the first issue, according to Eq. (5), the answer is clear that \(d_w\) is associated with the data paired by the same category. Given these paired data, for the second issue, due to that \(d_k\) is a kind of Wasserstein distance that builds on a transport in point-to-point way, we can perform a pairwise alignment to mimic this point-to-point process. Obviously, this solution above is not practical for two reasons: (1) we cannot pair the data with the same category accurately due to the absence of real target labels in the SFDA setting; and (2) how to encourage pairwise alignment is not clear.

To fix the first problem, we select data with neighbor relations in the feature space as the data pairs, denoted by \((\varvec{x}_i, \varvec{x}'_i) \in ({\mathcal {X}}_t \times {\mathcal {X}}_t)\) for \(1 \le i \le n\). Despite without the same category constraint, the alignment in transport manner still works. To explain it, we use Fig. 2 to illustrate this situation of aligning \(P_{c=\mathrm{{O}}}\) to \(Q_{c=\mathrm{{O}}}\) based on the data pairs with neighbor relation. Due to the feature closing, \(\varvec{x}_i\) and \(\varvec{x}'_i\) will have a similar but different distribution over all the categories. Thus, as shown in Fig. 2a, the data pairs only have two kinds, the ones sharing the same circle category, termed A group (connected with orange dotted line), and the ones with different but related two categories (their category distributions have an overlap), termed B group (connected with blue dotted line). As shown in Fig. 2b, the transport over A group can align \(P_{c=\mathrm{{O}}}\) to \(Q_{c=\mathrm{{O}}}\); the transport over B group broaden the final aligned distribution from \(Q_{c=\mathrm{{O}}}\) (green oval) to \({\widetilde{Q}}_{c=\mathrm{{O}}}\) (blue oval). It is clear that this transport cannot change the structure of the multi-class distribution but only blurs the category boundary to some extent.

Fig. 2
figure 2

The alignment from distribution \(P_{c=\mathrm{{O}}}\) to \(Q_{c=\mathrm{{O}}}\) in transport manner using the data pairs with neighbor relation. a Due to the neighbor constraint, the cross-distribution data pairs the can be divided into A group with the same category and B group with different but related category. b The transport over these data pairs encourage \(P_{c=\mathrm{{O}}}\) align to a wider distribution \({\widetilde{Q}}_{c=\mathrm{{O}}}\) covering the real target distribution \(Q_{c=\mathrm{{O}}}\). Clearly, this transport is work: the moving cannot change the structure of the multi-class distribution but only blurs the category boundary to some extent

To fix the second problem, considering that the data with the same category are close to each other in feature space, we encourage the pairwise alignment by introducing a pairwise category consistency constraint on the pairs \(\{(\varvec{x}_i, \varvec{x}'_i)\}_{i=1}^{n}\). Here, this consistency is only confined to single data-pair, and different pairs may share different categories.

3.4 Overview of Training

In our TPDS paradigm, the proxy distribution flow is supposed to converge to the ideal target distribution \(P_{\mathrm{{T}}}\) progressively. To this end, the adaptation training process is sliced into K successive stages \(\{E_k\}_{k=1}^K\). In \(E_k\), we perform a single-step searching for \(P_{\theta _k}\) w.r.t \(P_{\theta _{k-1}}\) via training the model from \(\theta _{k-1}\) to \(\theta _k\).

4 Model Instantiation

As a showcase of our paradigm, we implement a TPDS instantiation in deep learning. Without generality loss, we take the search process of \(P_{\theta _k}\) in stage \(E_k\) as an example in detail. Specifically, at the beginning of this stage, the model is initiated by \(\theta _{k-1}\), and a proxy distribution on the target domain is constructed. Next, we search for the optimal \(\theta _k\) in an unsupervised learning manner, achieved by a pairwise alignment with category consistency (PACon) algorithm.

4.1 Model Structure of \(\theta _k\)

During the transfer process, all models, including the source model \(\theta _s\) and all intermediate models \(\{\theta _k\}_{k=1}^{K}\) predicting the proxy distributions, have the same composition. Specifically, \(\theta _k\) consists of a feature extractor \(\phi _k\) and a classifier \(\upsilon _k\) with ending softmax operation, thus \(\theta _k=\upsilon _k \circ \phi _k\) whilst \(\theta _{k-1}=\upsilon _{k-1} \circ \phi _{k-1}\) where the operation \(\circ \) means the function composition. In concrete implementation, we use two neural networks as the two modules: (1) a deep architecture is taken as the feature extractor, and (2) a four layer network is used as the classifier. The more details are given in Implementation Details of Experiments section.

Fig. 3
figure 3

Overview of pairwise alignment from the distribution \({P_{{\theta _{k}}}}\) to distribution \({P_{{\theta _{k - 1}}}}\). Two steps are involved. 1 Shift estimation: Generating sample pairs across the two distributions by chain-like search on the data manifold. 2 Shift reduction: Pulling the sample pairs found by maximizing the mutual information in the prediction space

4.2 Overview of PACon

Corresponding to the insight from the previous section, our PACon algorithm has two successive components: (I) Distribution shift estimation, and (II) Distribution shift reduction, as shown in Fig. 3. The first component is based on a credible sampling method for generating data pairs. Specifically, at the beginning of any epoch \(E_k\), all target samples are first embedded by the previous-epoch model \(\theta _{k-1}\) to form a feature space for search. During \(E_k\), given an input batch data, we extract their features using the up-to-date model \(\theta _{k}\) and identify the paired samples in the search space with the chain-like search process (Fig. 4). The second component then aligns \(P_{\theta _k}\) to \(P_{\theta _{k-1}}\) by maximizing the mutual information entropy of those data pairs.

4.3 Distribution Shift Estimation

In the analysis above, we show that the distribution shift estimation is dependent on the data pairs with a neighbor relation. To account for the fact that the deep features locate on a data manifold, our credible sampling for data pairs detection builds the neighbor relation on the feature manifold by a chain-like searching. Furthermore, since the categories obtained in an unsupervised way are noisy, the sampled data forming the pairs are required to be credible, termed credible neighbor.

Suppose the data pairs constructed by the credible sampling are \(\{(\varvec{x}_i, \varvec{x}'_i)\}_{i=1}^{n} \in {\mathcal {X}}_t\) where \(\varvec{x}_i\) (from \({P_{{\theta _{k}}}}\)) and \(\varvec{x}'_i\) (from \({P_{{\theta _{k-1}}}}\)) are the input instance and its credible neighbor respectively. By model \(\theta _{k}\) and \(\theta _{k-1}\) respectively, \(\varvec{x}_i\) and all target data \({\mathcal {X}}_t\) are mapped into the feature space. To be concrete, the feature extractor \(\phi _{k}\) (in \(\theta _{k}\)) transforms \(\varvec{x}_i\) to feature \(\varvec{z}_i^{\star }\). The feature extractor \(\phi _{k-1}\) (in \(\theta _{k-1}\)) maps \({\mathcal {X}}_t\) into features \({\mathcal {Z}}=\{\varvec{z}_i\}_{i=1}^{n}\) where \(\varvec{z}_i=\phi _{k-1}(\varvec{x}_i)\) forming a data manifold. Then the classifier \(\upsilon _{k-1}\) converts \({\mathcal {Z}}\) to probability vectors \({\mathcal {P}}=\{\varvec{p}_i \}_{i=1}^{n}\) where \(\varvec{p}_i=\upsilon _{k-1}(\varvec{z}_i)\). The data pair construction can be performed in the following two steps.

Step A: Credible group construction. Firstly, we generate a group \({\mathcal {G}}_e\) using popular entropy-based ranking over \({\mathcal {Z}}\), like (Liu et al., 2021; Yang et al., 2020). With entropy computation, \({\mathcal {P}}\) converts to entropy set \({\mathcal {H}}=\{{h}_i \}_{i=1}^{n}\) where \({h}_i=-\sum {{\varvec{p}}}_{i}\log {{\varvec{p}}}_{i}\). Thus, \({\mathcal {G}}_{e}\) can be obtained by

$$\begin{aligned} \begin{aligned} {\mathcal {G}}_e= \left\{ \varvec{z}_i \mid {\varvec{z}_i \in {{\mathcal {Z}}}, \hspace{2.0pt}i \in \mathrm{{topk}}\left( {{{\mathcal {H}}}}, \sigma _e n \right) } \right\} , \end{aligned} \end{aligned}$$
(6)

where \(\sigma _e\) is also a scaling factor.

We consider this entropy based strategy (i.e., \({\mathcal {G}}_e\)) is limited in the sense that there exists a many-to-one projection problem between the prediction distributions and the entropy values, leading to ambiguous selection. To mitigate this problem, we introduce another selection criterion based on class-aware feature geometrical structure with a particular stress on the most likely class prediction. This provides additional information to the entropy measurement, while being highly correlated and thus redundant. This is because the higher probability the most likely class receives, the lower entropy for the prediction distribution.

Specifically, to enhance the credibility further, we split off another group, \({\mathcal {G}}_{o}\), by clustering-based ranking. We obtain C cluster centers by a weighted k-means method formulated by Eq. (7) where \(p_{i,c}\) is the c-th element of vector \(\varvec{p}_i\).

$$\begin{aligned} \varvec{o}_c = \frac{\sum _{i=1}^{n} p_{i,c} {\hspace{2.0pt}} {\varvec{z}}_{i}}{\sum _{i=1}^{n} p_{i,c}}, {\hspace{2.0pt}} 0 \le c \le C-1. \end{aligned}$$
(7)

Thus, the data credibility can be expressed by the minimum distance of the sample from the C cluster centers. Suppose the distances of \(\varvec{z}_i\) from \(\{\varvec{o}_c\}_{c=1}^{C}\) form vector \(\varvec{b}_i \in {\mathbb {R}}^{C}\). The c-th element of \(\varvec{b}_i\), standing for the distance from the c-th cluster center, equals \(D_{cos}\left( \varvec{z}_i, \varvec{o}_c\right) \) where \(D_{cos}\) means the cosine-distance. Let \(a_i=\mathrm{{min}}(\varvec{b}_i)\) be \(\varvec{x}_i\)’s minimum distance from the C centers, so that we get a measure set \({\mathcal {A}}=\{{a}_i \}_{i=1}^{n}\) over the target data. Thus, we can obtain \({\mathcal {G}}_o\) by

$$\begin{aligned} \begin{aligned} {\mathcal {G}}_o = \left\{ \varvec{z}_i \mid {\varvec{z}_i \in {{\mathcal {Z}}}, i \in \mathrm{{topk}}\left( {{{\mathcal {A}}}}, \sigma _o n \right) } \right\} , \end{aligned} \end{aligned}$$
(8)

where \(\sigma _o\) is a scaling factor, \(\mathrm{{topk}}\left( {\mathcal {X}}, k\right) \) selects the top k lowest elements from set \({\mathcal {X}}\) and returns their indexes.

Finally, we get the credible data group \({\mathcal {G}}\) by an intersection operation as \({\mathcal {G}} = {\mathcal {G}}_o \cap {\mathcal {G}}_e\). In Fig. 4, these feature data belonging to \({\mathcal {G}}\) are marked in blue.

Fig. 4
figure 4

Illustration of credible sampling. The feature vectors of all the target domain data extracted by model \(\theta _{k-1}\) form a search space. Given the feature vector of a target domain sample extracted by model \(\theta _{k}\) (red triangle), the objective is to identify the best match according to the manifold geometry of the search space

Step B: Data pair generation. To construct the data pairs, we find credible neighbor \(\varvec{x}'_i\) from \({\mathcal {G}}\) by a chain-like search, as illustrated in Fig. 4. Starting with the feature of the input instance \(\varvec{z}_i^{\star }\) (the red triangle), we carry out a one-step search to find its nearest neighbor, denoted by \(\hat{\varvec{z}}_1\), based on the cosine distance. If \(\hat{\varvec{z}}_1\) does not belong to \({\mathcal {G}}\), we conduct another one-step search for the new nearest neighbor \(\hat{\varvec{z}}_2\), taking \(\hat{\varvec{z}}_1\) as the start. Repeating this process, we build a search flow \(\{\varvec{z}_i^{\star }, \hat{\varvec{z}}_1, \hat{\varvec{z}}_2, \dots , {\varvec{z}'}_i\}\) reaching \({\mathcal {G}}\). In this flow, the end element \({\varvec{z}'}_i = \phi _{k-1}(\varvec{x}'_i)\) (the yellow triangle) is the feature of the credible neighbor that we are seeking whilst other elements (marked in light blue), except \(\varvec{z}_i^{\star }\), are the intermediate features. Note that to avoid the circle path on the flow, our search forgets its history. For example, when searching for \(\hat{\varvec{z}}_k\), the historical elements \(\varvec{z}_i^{\star }\) and \(\{\hat{\varvec{z}}_j\}_{j=1}^{k-1}\) are excluded from the similarity comparison.

Algorithm 1
figure a

Overall training of TPDS

4.4 Distribution Shift Reduction

In our instantiation, the distribution shift is reduced by pairwisely aligning these detected data pairs under the category consistency constraint. Inspired by the theoretical correlation of mutual information and pairwise losses (Boudiaf et al., 2020), we use the following objective to reach the goal where \(I(\cdot ,\cdot )\) is the mutual information function (Paninski, 2003) whose computation is the same as the way in (Ji et al., 2019).

$$\begin{aligned} \begin{aligned} {L}_{\mathrm{{W}}}(\theta _k) = \min _{\theta _k} \left[ -I \left( \theta _k(\varvec{x}_i), \theta _{k-1}(\varvec{x}'_i) \right) \right] . \end{aligned} \end{aligned}$$
(9)

Of note, \({L}_W\) is sensitive to the target dataset scale. When the scale is small, since the limited data cannot describe the probability distribution well, the optimization effect based on the single regulator \({L}_W\) is restricted. To overcome this problem, we introduce the diversity loss encouraging the balance of category prediction. This skill is widely adopted by the unsupervised approaches for multi-way classification (Jabi et al., 2019) to avoid the solution collapse (Ghasedi Dizaji et al., 2017) in which the model predicts all data as some specific categories. Suppose that \(\theta _k\) transforms \(\varvec{x}_i\) to a probability vector \(\varvec{q}_i\), this loss is expressed as

$$\begin{aligned} {L}_{\mathrm{{B}}}(\theta _k) = \min _{\theta _k}\sum _{c=1}^C\mathrm{{KL}} \left( \left. {\bar{q}_{c}} \right\| {\varrho }_{c}\right) , \end{aligned}$$
(10)

where \(\mathrm{{KL}}(\cdot ||\cdot )\) means the \(\mathrm{{KL}}\)-divergence loss function; \({\varrho }_{\{c=1, \cdots , C\}} = \frac{1}{C}\) is uniform distribution; \(\bar{q}_c = \frac{1}{n} \sum _{i=1}^{n}{q}_{i,c}\) is empirical label distribution, in which \({q}_{i,c}\) is the probability of \(\varvec{x}_i\) in the c-th category. Combining with Eq. (9), we have the final objective:

$$\begin{aligned} \begin{aligned} L_{\mathrm{{TPDS}}}(\theta _k) = {L}_{\mathrm{{W}}}(\theta _k) + \beta _{n}{L}_{\mathrm{{B}}}(\theta _k), \end{aligned} \end{aligned}$$
(11)

where \(\beta _n\) trades off the two regularizations; its value is related to the dataset scale (represented by the target data number n): the smaller the dataset scale, the larger the \(\beta _n\) value (Its rationality is verified in Experiments). For clarity, we also summarize the overall training of TPDS to Alg. 1.

5 Experiments and Analyses

5.1 Data Sets

In this paper, we evaluate our method on five widely used benchmarks as follows.

Digits (Hoffman et al., 2018). As a typical dataset in UDA problems, we use the three most frequently used subsets under Digits, i.e., SVHN (S), MNIST (M), and USPS (U). They contain the images of digits from 0 to 9 in different environments. We trained the method on three relatively challenging cross-domain tasks on the digit dataset, i.e., S\(\rightarrow \)M, U\(\rightarrow \)M, and M\(\rightarrow \)U.

Office-31 (Saenko et al., 2010). Office-31 is a small-scale dataset that is widely used in domain adaptation including three domains, i.e., Amazon (A), Webcam (W), and Dslr (D), all of which are taken of real-world objects in various office environments. The dataset has 4,652 images of 31 categories in total. Images in A are online e-commerce pictures. W and D consist of low-resolution and high-resolution pictures, respectively.

Office-Home (Venkateswara et al., 2017). Office-Home is a medium-scale dataset that is mainly used for domain adaptation, all of which contains 15k images belonging to 65 categories from working or family environments. The dataset has four distinct domains, i.e., Artistic images (Ar), Clip Art (Cl), Product images (Pr), and Real-word images (Rw).

VisDA (Peng et al., 2017). VisDA is a challenging large-scale dataset with 12 types of synthetic to real transfer recognition tasks. The source domain contains 152k synthetic images, while the target domain has 55k real object images from Microsoft COCO.

PACS (Li et al., 2017). PACS is an image dataset for domain generalization. It consists of four subdomains with 9.9k images sharing seven categories. The domains are Photo (P), Art Painting (A), Cartoon (C) and Sketch (S).

5.2 Implementation Details

5.2.1 Neural Network Architecture

We design and implement our network architecture based on Pytorch. We can divide the above datasets into two types: Digit recognition and Object recognition. For the digit recognition task, we use a variational LeNet as a feature extraction module, as done in Liang et al. (2020). For the object recognition task, following the standard practice for fair comparison, we use neural networks including both the feature extractor \(\phi _k\) and the classifier \(\upsilon _k\) per model \(\theta _k\). The feature extractor \(\phi _k\) contains a heavyweight deep architecture, a batch-normalization layer and a full-connect layer with a size of 2048x256. As done in Liang et al. (2020), Yang et al. (2021), Tang et al. (2022) for the deep architecture, we adopt ImageNet pretrained ResNet50 (He et al., 2016) on Office-31, Office-Home and PACS, and ResNet101 (He et al., 2016) on VisDA. For all datasets, the classifier \(\upsilon _k\) takes the same structure as initially used in (Liang et al., 2020; Yang et al., 2020, 2021; Tang et al., 2022). The input layer is a fully-connected layer with batch normalization. The output layer is a fully-connected layer with weight normalization.

5.2.2 Source Model

\(\theta _s\) training. For all evaluation datasets, \(\theta _s\) was pretrained in the standard protocol (Liang et al., 2020; Tang et al., 2021; Yang et al., 2021). The adopt objective for training is given in Appendix B. We split the labelled source data into two parts of 90%:10% for model pretraining and validation. We set the training epochs on Digit, Office-31, Office-Home, PACS and VisDA to 30, 100, 50, 50 and 10, respectively.

Table 1 Classification accuracies (%) on the Digit dataset
Table 2 Classification accuracies (%) on the Office-31 dataset based on ResNet50 backbone
Table 3 Classification accuracies (%) on the Office-Home dataset based on ResNet50 backbone
Table 4 Classification accuracies (%) on the VisDA dataset based on ResNet101 backbone

5.2.3 TPDS Training

We adopt the stochastic gradient descent (SGD) optimizer with a momentum of 0.9 and weight decay of 0.001. The learning rate is set to 0.01 for Office-31, Office-Home and PACS, 0.001 for VisDA. We train 15 epochs at a batch size of 64 on each target domain. TPDS has three hyperparameters: We set the scaling factors \((\sigma _o, \sigma _e)=(0.6,0.5)\) for the credible group construction on all target domains, whilst \(\beta _n=1/0.5/0\) for small dataset (Office-31), medium dataset (Office-Home and PACS) and large dataset (VisDA) according to the dataset scale.

5.3 Competitors

To verify the effectiveness of our method, we select 35 comparison methods, which can be divided into following two groups according to whether access to the source data during the transfer phase.

  1. (1)

    19 state-of-the-art vanilla domain adaptation methods, all requiring source and target data at the same time to solve the domain shift. They are ADDA (Tzeng et al., 2017), ADR (Saito et al., 2018), CDAN (Long et al., 2018), CyCADA (Hoffman et al., 2018), SWD (Lee et al., 2019), CAT (Deng et al., 2019), BSP (Chen et al., 2019), TN (Wang et al., 2019), SAFN (Xu et al., 2019), IA (Jiang et al., 2020), DMRL (Wu et al., 2020), STAR (Lu et al., 2020), MCC (Jin et al., 2020), CGDM (Du et al., 2021), TCM (Yue et al., 2021), SRDC (Tang et al., 2020), SUDA (Zhang et al., 2022), CaCo (Huang et al., 2022) and MSGD (Xia et al., 2022).

    Table 5 Classification accuracies (%) on the PACS dataset based on ResNet50 backbone
  2. (2)

    15 current state-of-the-art SFDA models, such as SFDA (Kim et al., 2021), 3C-GAN (Li et al., 2020), SHOT (Liang et al., 2020), BAIT (Yang et al., 2020), HMI (Lao et al., 2021), PCT (Tanwisuth et al., 2021), CPGA (Qiu et al., 2021), AAA (Li et al., 2022), PS (Du et al., 2021), GKD (Tang et al., 2021), A2Net (Xia et al., 2021), NRC (Yang et al., 2021), VDM (Tian et al., 2022), NEL (Ahmed et al., 2022) and U-SFAN+ (Roy et al., 2022). Among them, method 3C-GAN, PS, AAA, A2Net and VDM are based on the pseudo source domain generation or construction, whilst the rest methods are based on the framework of self-supervised learning.

Fig. 5
figure 5

t-SNE feature visualization on transfer task Cl \(\rightarrow \) Ar in Office-Home. Top: The first and end sub-figures presents the feature distribution embedded by the source model \(\theta _s\) (before adaptation) and the adapted model \(\theta _t\) (after adaptation), respectively; from left to right the rest three ones sample the feature distribution evaluation in the 2, 5 and 10 epoch. Bottom: The aggregation details with category information is shown correspondingly; for clarity, we select the first 30 classes from the 65 categories in total, where different colors stand for different classes

Fig. 6
figure 6

Left: Accuracy variation on task Cl \(\rightarrow \) Ar in the Office-Home dataset as \(\sigma _o\) and \(\sigma _e\) varying. Middle: Accuracy variation curves on the three evaluation datasets as \(\beta _n\) varying. For a clear view, all results on each dataset are normalized by the best accuracy on this dataset. Right: Accuracy and loss variation curve during training on task Cl \(\rightarrow \) Ar in the Office-Home dataset

5.4 Comparative Results

5.4.1 Digit Recognition

As reported in Table 1, TPDS obtains the best results on the all tasks compared with SHOT and has a 0.3% increase in average accuracy. Compared with these UDA work, TPDS achieves the highest performances on 2 out of 3 tasks, except for the transfer task U\(\rightarrow \)M, surpassing the best method SWD by 0.4% in average accuracy.

5.4.2 Object Recognition

Table 2, 3, 4, 5 present the quantitative results on the four datasets. On Office-31 (Table 2), TPDS obtains best results on two transfer tasks among these SFDA methods, A\(\rightarrow \)D and A\(\rightarrow \)W, leading to the 90.2% average accuracy, which increases by 0.2% compared to the second-best method A2Net (90.0%). On Office-Home (Table 3), TPDS defeats other methods in 5 out of 12 tasks. Compared with the previous best method A2Net (72.8%), TPDS improves by 0.7% average accuracy and reaches 73.5%. On VisDA (Table 4), besides two categories, person and truck, TPDS achieves the best performance. In average accuracy, TPDS obtain 87.6% and surpasses the previous best method NRC by 1.7%. On PACS (Table 5), TPDS obtains best results on all except for two transfer tasks with small gap, A\(\rightarrow \)P and C\(\rightarrow \)S. Especially, TPDS improves by 20.8% and 15.1% on transfer task A\(\rightarrow \)S and task P\(\rightarrow \)S respectively compared with the second-best method GKD. As a result, TPDS defeats SHOT by a margin of 3.7% in average accuracy.

Besides, compared with these conventional UDA methods needing the access to the source data, TPDS is also competitive on the three object recognition datasets, as shown in Table 2, 3, 4, despite without the facilitating from source data during the adaptation phase. Specifically, on the Office-31 dataset, TPDS has a gap of 0.6% compared with the best UDA method MSGD. However, with the increase of target data, the advantage of TPDS grows further, surpassing the best method MSGD by 1.1% and 3.0% respectively on the Office-Home and VisDA datasets. To sum up, these comparison results on the five datasets mentioned above confirm the state-of-the-art performance of TPDS.

5.5 Further Analyses

5.5.1 Feature Visualization

Using the widely used visualization tool t-SNE (Van der Maaten & Hinton, 2008), we conduct a feature visualization experiment based on the 65-way classification results of transfer task Cl\(\rightarrow \)Ar in the Office-Home dataset. Figure 5 presents the visualization results. As shown in the top, before adaptation, the intertwined features, embedded by the source model \(\theta _s\), distribute without apparent aggregation (the first sub-figure); after adaptation, the features aggregate evidently (the ending sub-figure). From left to right, the three sub-figure show that the features gradually cluster during the adaptation phase. For clear observation, we select the first 30 from total of 65 categories to present the clustering details. As shown in the bottom, where different colors stand for different categories, the aggregation is performed with category meaning. The visualization results show that TPDS can predict a probability distribution with category meaning.

5.5.2 Hyperparameter Sensitivity

To validate the sensitivity of \(\sigma _o\) and \(\sigma _e\), we conduct 30 experiments as \(\sigma _o \in [0.3, 0.8]\) and \(\sigma _o \in [0.3, 0.7]\) based on task Cl\(\rightarrow \)Ar in the Office-Home dataset. As shown on the left of Fig. 6, the accuracy does not change drastically. Thus, TPDS’s performance is robust to \(\sigma _o\), \(\sigma _e\). As for the sensitivity of \(\beta _n\), we conduct 6 experiments as \(\beta _n\) varies from 0 to 1.0 on each testing dataset. As shown in the middle of Fig. 6, as the dataset size increases, Office-31 \(\rightarrow \) Office-Home \(\rightarrow \) VisDA, the best result occurs when \(\beta _n=1.0/0.6/0.0\), respectively. This phenomenon is consistent with our expectation. As mentioned above, when the amount of data is not enough to describe the distribution, \(L_{\mathrm{{B}}}\) can boost the \(L_{\mathrm{{W}}}\)-only-based transfer. Conversely, \(L_{\mathrm{{B}}}\) will deteriorate the performance. These results confirm the rationality of our setting \(\beta _n\) to be related to the dataset size.

5.5.3 Training Stability

On the right of Fig. 6, we present the training stability of TPDS on task Cl\(\rightarrow \)Ar in the Office-Home dataset. As the training goes from epoch 1–15, the accuracy rapidly climbs at the early epochs (from epoch 1–4) and converges to the maximum through a slow increase with small vibrations (from epoch 4–13). The loss value of \(L_{\mathrm{{TPDS}}}\) over all epochs gradually decreases, whose trend is consistent with the performance change. The phenomenon indicates that the training of TPDS is stable and reliable.

5.5.4 Extentability

In the spirit of SHOT++ (Liang et al., 2021) that leverages additional training components (e.g., MixMatch for data augmentation (Berthelot et al., 2019)) on top of SHOT, we carry out a test for model extentability where our TPDS is further enhanced by MixMatch, termed TPDS+MixMatch. As reported in Table 6, TPDS+MixMatch obtains better results compared with the original version, which confirms the extentability of our TPDS.

5.5.5 Computational Cost

We compare our method with SHOT in terms of the average training time. The results in Table 7 show that our model training is some slower due to the gradual adaptation nature, which is a reasonable cost for better adaptation performance.

5.5.6 Convergence of Chain-like Search

We evaluate the convergence of our chain-like search. We first track the average steps of the search during the training phase. As shown in Table 8, our model takes less than two steps to reach another sample in the credible group. The statistics as reported in Fig. 7 further indicates that our search process converges well across the training epochs.

5.6 Distribution Shift Analysis

TPDS is a probability distribution alignment based scheme. This sub-section gives a distribution shift analysis using the measure of MMD distance (ZongxianLee, 2019) to verify whether TPDS reduces the match error via the progressive alignment. In this experiment, we feedforward the target data \({\mathcal {X}}_t\) through the source model \(\theta _s\) and take the outputs as an empirical estimation of the source distribution, i.e., \({P}_{\mathrm{{S}}}=\theta _s({\mathcal {X}}_t)\). Besides, we train an idea target model \(\tilde{\theta }_t\) over \({\mathcal {X}}_t\) with labels like the source model training (Appendix B) and use the outputs to express the ideal target domain, i.e., \({P}_{\mathrm{{T}}}=\tilde{\theta }_t({\mathcal {X}}_t)\). For comparison, we select SHOT as the baseline owing to the same epoch-wise learning strategy but without errors mitigation. The prediction target distributions obtained by TPDS and SHOT are presented via \({\hat{P}}_{\mathrm{{T:tpds}}}=\theta _{tpds}({\mathcal {X}}_{t})\) and \({\hat{P}}_{\mathrm{{T:shot}}}=\theta _{shot}({\mathcal {X}}_{t})\), respectively.

The first two sub-figures in Fig. 8 display the MMD distance variation to \({P}_{\mathrm{{S}}}\) and \({P}_{\mathrm{{T}}}\) during the training, respectively. Both \({\hat{P}}_{\mathrm{{T:tpds}}}\) and \({\hat{P}}_{\mathrm{{T:shot}}}\) move away from \({P}_{\mathrm{{S}}}\), but the distance of \({\hat{P}}_{\mathrm{{T:tpds}}}\) to \({P}_{\mathrm{{S}}}\) keeps always less than \({\hat{P}}_{\mathrm{{T:shot}}}\) (the left). This indicates that our progressive alignment is effective for reducing the distance to \({P}_{\mathrm{{S}}}\). Thus, the accurate guidance provided by \({P}_{\mathrm{{S}}}\) are better reserved. Correspondingly, \({\hat{P}}_{\mathrm{{T:tpds}}}\) and \({\hat{P}}_{\mathrm{{T:shot}}}\) close to \({P}_{\mathrm{{T}}}\) with an interesting observation (the middle). SHOT’s distance decreases rapidly at the early epochs (from 1 to 5), followed by a gradual increase after 6-epoch. In contrast, TPDS’s distance presents a declining trend through all the epochs. This phenomenon is understandable. In the late stages, the errors in pseudo-labels propagate further in the adaptation regulated by SHOT. However, TPDS can control this propagation due to that we introduce the adapt error mitigation mechanism.

Table 6 Effect of MixMatch data augmentation
Table 7 Training efficiency evaluation
Table 8 Evaluating the chain-like search steps needed during training
Fig. 7
figure 7

Chain-like search steps on task Ar\(\rightarrow \)Cl in Office-Home. Top: The average steps in epoch view where the red line stands for the average step over all epochs. Bottom: The maximal steps of each epoch

According to Theorem 1, a progressive alignment on the proxy distribution flow encourages a matching error minimization to the ideal target distribution \({P}_{\mathrm{{T}}}\). To verify this conclusion, we perform a further distribution shift analysis on all 12 transfer tasks in the Office-Home dataset. Unlike the analysis mentioned above, we do not use the intermediate models, insteading of the final model finishing 15 epochs training. Besides, the horizontal and vertical coordinates are changed to the distance to \({P}_{\mathrm{{S}}}\) and \({P}_{\mathrm{{T}}}\), respectively, to discover the relation between them. To account for the different cross-domain shift on these tasks, we take a normalization operation on the MMD distance for a clear view. Specifically, the distance of \({\hat{P}}_{\mathrm{{T:tpds}}}\) to \({P}_{\mathrm{{S}}}\) and \({P}_{\mathrm{{T}}}\) are normalized by the corresponding ones of \({\hat{P}}_{\mathrm{{T:shot}}}\), respectively. As shown in the right of Fig. 8, all task points locate in the \([0,1] \times [0,1]\) area and arrange along a red line with a positive slope obtained by linear regression. It is shown that when the distribution shift from the prediction target distribution (\({\hat{P}}_{\mathrm{{T:tpds}}}\)) to the source distribution (\({P}_{\mathrm{{S}}}\)) is controlled to small, the shift from \({\hat{P}}_{\mathrm{{T:tpds}}}\) to \({P}_{\mathrm{{T}}}\), i.e., the matching error, is suppressed correspondingly. Furthermore, they have a positive correlation. Clearly, the results are consistent with the stated in Theorem 1.

5.7 Ablation Study

5.7.1 Effectiveness of Objective Components

This part isolates the effect of the loss components in objective \(L_{\mathrm{{TPDS}}}\). Our ablation experiment is conducted in an incremental way, as shown in Table 9. When both \(L_{\mathrm{{W}}}\) and \(L_{\mathrm{{B}}}\) are unavailable (the first row), this is the result obtained by the source model \(\theta _s\) only. When only \(L_{\mathrm{{W}}}\) is used for adaptation, the adapted model improves by 10.0% at least. When \(L_{\mathrm{{B}}}\) is added, the performance improves further. The comparisons show that both \(L_{\mathrm{{W}}}\) and \(L_{\mathrm{{B}}}\) effectively improve the transfer performance.

Fig. 8
figure 8

Distribution shift analysis using MMD distance on the Office-Home dataset. Left and Middle: The MMD distance variation to the source distribution and the target distribution, respectively, during the training on transfer task Cl\(\rightarrow \)Ar. Right: The further distribution shift analysis on all 12 transfer tasks. A, C, P and R are short for domain Ar, Cl, Pr and Rw, respectively; PR means Pr\(\rightarrow \)Rw. Different from the analysis on Cl\(\rightarrow \)Ar using the intermediate models in adaptation process, these distances are computed by the final trained model

5.7.2 Effectiveness of Progressive Searching Strategy

In the TPDS praradigm, the usage of progressive searching strategy simplifies the large domain shift, between the source and target domain, into several successive single-step searching tasks with small shift. To verify its effect, we give a variation method, denoted as TPDS-w/o-Progressive, without our progressive error control but a fixed one. Specifically, through all the epochs, we encourage the current distribution \(P_{\theta _{k}}\) align to the source distribution \(P_{\theta _{s}}\), rather than the previous epoch one \(P_{\theta _{k-1}}\). As reported in the first two rows of Table 10, TPDS-w/o-Progressive is still superior to SHOT but clearly inferior to our full model. This indicates the efficacy of the proposed progressive error control strategy.

5.7.3 Effectiveness of Credible Sampling

In our TPDS instantiation, the credible sampling is the key procedure to estimate the distribution shift between two adjacent proxy distributions. It involves two technical components: (1) In the search space, i.e., the credible group \({\mathcal {G}}\), (2) the credible neighbors are detected by a chain-like search on a feature manifold. To isolate their effectiveness, we give two TPDS variations as:

  1. (1)

    TPDS-w/o-\({\mathcal {G}}\): We extend the search space to the overall features of all target data instead of the original credible group \({\mathcal {G}}\). Due to the absence of \({\mathcal {G}}\), rendering the chain-like search unavailable, we directly select the nearest data as the credible neighbor based on the cosine distance computation.

  2. (2)

    TPDS-w/o-ChainSearch: We directly project the input feature sample to the feature space formed by \({\mathcal {G}}\), without detecting the credible neighbor by the chain-like search.

Table 9 Ablation study results (%) on loss terms with TPDS objective
Table 10 Ablation study results (%)

As reported in Table 10, the performances of the three methods rank in descending order as TPDS > TPDS-w/o-Manifold > TPDS-w/o-\({\mathcal {G}}\) in average accuracy. The results indicate that the introducing of both credible group and the manifold hypothesis with chain-like search can boost the final performance. Also, of note that the implementation difference of TPDS-w/o-Manifold and TPDS-w/o-\({\mathcal {G}}\) is whether adopt the credible group \({\mathcal {G}}\) as the search space. The better results of TPDS-w/o-Manifold than TPDS-w/o-\({\mathcal {G}}\) imply that finding credible data are helpful for our unsupervised learning. It is understandable that we absorb the guidance of accurate category information.

5.7.4 Effectiveness of the Cross-Distribution Pairwise Alignment Based on Mutual Information

The cross-distribution pairwise alignment on adjacent proxy distributions is encouraged by the mutual information (MI) maximization on the generated data pairs. To evaluate its effectiveness, we propose two comparisons using conventional measure for the aligning: (1) TPDS-w-MMD, and (2) TPDS-w-KL where the optimization objective of mutual information is changed to MMD and KL-Divergence, respectively.

From the fifth and sixth rows in Table 10, it is seen that both TPDS-w-MMD and TPDS-w-KL have a large gap on all datasets compared with TPDS, along with lower performance than the source-model on Office-31 and Office-Home. This phenomenon confirms the critical role of mutual information maximization in our pairwise alignment. In addition, performance deterioration of the two comparison methods is explainable that the loss of MMD and KL-Divergence are not pairwise objectives whose computations are based on the entire set.

5.7.5 Effectiveness of Credible Sample Construction

To evaluate the credible sample construction, we compare our design (\({\mathcal {G}}_{e} \cap {\mathcal {G}}_{o}\)) with only using either criterion (\({\mathcal {G}}_{e}\) or \({\mathcal {G}}_{o}\)), termed TPDS-w-\({\mathcal {G}}_e\) and TPDS-w-\({\mathcal {G}}_e\). The results in Table 10 (the last row group) shows that the two selection criteria are complementary for the performance benefit, validating the efficacy of our design.

6 Conclusion

In this work, we have proposed a new Target Prediction Distribution Searching (TPDS) paradigm for SFDA. Unlike the previous methods adopting the conventional feature distribution alignment strategy, TPDS seeks for the target prediction distribution with a principled adaptation error mitigation mechanism. Concretely, we construct a flow of proxy prediction distributions and regulate them to be slightly shifted between adjacent ones. Such that this flow smoothly converge to the target distribution long the geodesic path, on which the overall cumulative errors can be elegantly alleviated. The experiment results on five benchmarks show that TPDS can achieve state-of-the-art performance under the SFDA setting.