Abstract
Existing Source-Free Domain Adaptation (SFDA) methods typically adopt the feature distribution alignment paradigm via mining auxiliary information (eg., pseudo-labelling, source domain data generation). However, they are largely limited due to that the auxiliary information is usually error-prone whilst lacking effective error-mitigation mechanisms. To overcome this fundamental limitation, in this paper we propose a novel Target Prediction Distribution Searching (TPDS) paradigm. Theoretically, we prove that in case of sufficient small distribution shift, the domain transfer error could be well bounded. To satisfy this condition, we introduce a flow of proxy distributions that facilitates the bridging of typically large distribution shift from the source domain to the target domain. This results in a progressive searching on the geodesic path where adjacent proxy distributions are regularized to have small shift so that the overall errors can be minimized. To account for the sequential correlation between proxy distributions, we develop a new pairwise alignment with category consistency algorithm for minimizing the adaptation errors. Specifically, a manifold geometry guided cross-distribution neighbour search is designed to detect the data pairs supporting the Wasserstein distance based shift measurement. Mutual information maximization is then adopted over these pairs for shift regularization. Extensive experiments on five challenging SFDA benchmarks show that our TPDS achieves new state-of-the-art performance. The code and datasets are available at https://github.com/tntek/TPDS.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Due to the increasing demand for information security and privacy protection, data sharing across domains becomes less possible. Also, conventional unsupervised domain adaptation (UDA) setting is questioned recently in the term of necessity of access to the source domain (Chidlovskii et al., 2016; Lao et al., 2021; Tanwisuth et al., 2021). In this context, model transfer turns out to be promising (Kim et al., 2021; Li et al., 2020; Liang et al., 2020). This is known as Source-Free Domain Adaptation (SFDA), which aims to adapt a pretrained model to a target scenario in an unsupervised manner without access to source domain training data.
Existing SFDA methods rely on mining auxiliary information in order to enable the adoption of well-established feature alignment algorithms (Long et al., 2018; Hoffman et al., 2018) following two strategies. The first one creates a fake source domain by generative models (Li et al., 2020) or by source hypothesis-based target data splitting (Du et al., 2021), and further aligns the pseudo source data and the target data in feature space like UDA. The second one performs self-supervised learning to transfer the source model to the target domain. In practice, the techniques of pseudo-labels (Liang et al., 2020), source prototypes (Tanwisuth et al., 2021), and target geometric information (Tang et al., 2022) are used to guide the self-supervised learning. Essentially, the two strategies perform a feature distribution alignment in an explicit way (the first one) or an implicit way (the second one). In Fig. 1a, we illustrate the distribution alignment process. The given source model gives a prediction for the source feature distribution marked in orange. The feature distribution alignment encourages the embedded target data (marked in green) to move/cluster toward the correct class cluster of the source feature distribution. Thus, the frozen classifier in the source model can correctly predict categories for the target data. However, aligning the feature distribution for SFDA is challenging at the absence of source domain training data and target domain labels. First of all, these auxiliary information is error-prone, suffering from further error propagation risk. Furthermore, this limitation would be easily amplified typical to existing SFDA methods due to lacking error mitigation.
To overcome the aforementioned foundational limitation, in this work a novel Target Prediction Distribution Searching (TPDS) paradigm is introduced. We reformulate the SFDA problem as searching the target prediction distribution, in contrast to conventional feature distribution alignment (Fig. 1b). The target prediction distribution is formed as the model output of all the unlabeled training samples from the target domain. The key challenge is how to mitigate the misleading effect caused by the unknown errors of predicted label distributions. To tackle this obstacle, we search a proxy \({{{\hat{P}}}_{T}}\) under an approximated condition where the source and target domains share the same distribution. That is, the adaption error needs to be minimized. To achieve this, we introduce a progressive search strategy based on a flow of proxy distributions with adjacent ones being slightly shifted (Fig. 1c). As a result, a typically large distribution gap from source to target domain can be gradually shifted away in multiple stages. Essentially, understand the \({P}_{\mathrm{{S}}}\) and \({P}_{\mathrm{{T}}}\) as two distinct points in a metric space, the searching induced by TPDS aims to find an optimal geodesic path in-between with minimal accumulative errors (Fig. 1d). Critically, we prove theoretically that when the distribution shift on this path is sufficiently small, the transfer error across the two domains could be well bounded (Theorem 1).
We further instantiate a TPDS model in deep learning. Concretely, we split the whole training process evenly into multiple stages. Each stage corresponds to a single-step searching driven by aligning two adjacent proxy distributions in this flow. To that end, we design a new algorithm named Pairwise Alignment with Category Consistency (PACon). More specifically, manifold geometry guided credible sampling discovers the potential data pairs (i.e., shift estimation), followed by mutual information maximization based optimization for shift reduction.
The contributions of this work are summarized as follows:
-
(1)
We propose a novel TPDS paradigm for SFDA without high reliance on the accuracy of source domain auxiliary information. Critically, TPDS comes with theoretical guarantee on adaption error mitigation, which is largely lacking in previous feature distribution alignment based alternatives.
-
(2)
To mitigate the cross-domain transfer error, we develop a new PACon method to align any two adjacent distributions in a flow. Unlike the popular shift measures such as MMD or KL-divergence, PACon encourages pairwise alignment with explicit geometric semantics intrinsic to adjacent distributions.
-
(3)
We evaluate the proposed approach on five challenging domain adaptation benchmarks. The extensive experiments show that our TPDS yields new state-of-the-art results.
The rest of the paper is organized as follows. Section 2 introduces the related work. Section 3 details the proposed paradigm, followed by the model instantiation in Sect. 4. Section 5 presents the experimental results and analyses. Section 6 draws the conclusion in the end.
2 Related Work
2.1 Unsupervised Domain Adaptation
For UDA, the key is to reduce the domain drift. Since the source and target data are accessible during the transfer phase, probability matching becomes the main idea to solve this problem. Based on whether to use a deep learning algorithm, current work in UDA can be divided into two categories: 1) deep-learning-based and 2) non-deep-learning based. In the first category, researchers rely on techniques such as metric learning to reduce domain drift (Long et al., 2015, 2018; Pan et al., 2019). In these methods, an embedding space with unified probability distribution was learnt by minimizing certain statistical measures, e.g., MMD (maximum mean discrepancy) (Tzeng et al., 2014), which were used to evaluate the discrepancy of the domains. In addition, adversarial learning has been another popular framework for its capability of aligning the probabilities of two different distributions (Hoffman et al., 2018; Zhang et al., 2019; Munro & Damen, 2020). The second category reduces the drift in diverse manners. For example, focusing on the energy, an energy distribution-based classifier was developed to detect the confidence target data (Tang et al., 2019). The structural knowledge (Xia et al., 2022), category contrast (Huang et al., 2022) and spectral information (Zhang et al., 2022) were exploited to boost the adaptation. In all the aforementioned methods, the source data is indispensable as labeled samples were used to explicitly formulate domain knowledge (e.g., probability, structural structure, energy or spectral information). When the labeled data in the source domain are not available, these conventional UDA methods fail.
2.2 Source-Free Domain Adaptation
The current mainstream approach to SFDA follows the paradigm of feature distribution alignment. Existing methods can be generally divided into two classes. The first class performs explicitly feature alignment by converting SFDA to the conventional UDA problem (Li et al., 2020; Du et al., 2021; Tian et al., 2022). These methods reconstruct a fake source domain under some source hypothesis, and further align the target data to the pseudo source data in the feature space. The second class conducts alignment implicitly by adapting the source model to the target domain based on self-supervised learning. In the absence of source domain data, the source model is used to generate an auxiliary factor, such as hard samples (Li et al., 2022) or prototypes (Tanwisuth et al., 2021), to assist feature alignment. On the other hand, alternative methods mine the auxiliary information from the target domain data. Except for widely used pseudo-labels, e.g., clustering-based pseudo-labels generation (Liang et al., 2020), pseudo-labels denoising (Chen et al., 2021; Ahmed et al., 2022), geometry information like the intrinsic neighborhood structure (Yang et al., 2021) and data manifold (Tang et al., 2022) have also been exploited for guiding model adaptation. In contrast to all the previous methods, we introduce a novel target prediction distribution search paradigm conceptually different from feature distribution alignment.
2.3 Gradual Domain Adaptation
In transfer learning, the most relevant work with ours is Gradual Domain Adaptation (GDA) performing knowledge transfer in the time dimension (e.g., years). In this setting, the variety dynamics is given and represented by a series of intermediate unlabeled domains between source domain and target domain. At the high level, Gradual self-training (GST) is the main strategy. There are two main research lines. The first line extends the GST framework to address a variety of GDA cases, such as the scenario without some intermediate domains (Abnar et al., 2021) or without the predefined intermediate domain index (Chen et al., 2021). The second line (Kumar et al., 2020; Wang et al., 2022) focus on understanding GDA with theoretical analysis. Our work differs significantly from GDA due to no intermediate domains, rendering previous GDA methods inapplicable.
2.4 Progressive Transfer in Domain Adaptation
Existing progressive transfer methods for domain adaptation can be split into three groups. The first group is subspace-based, assuming that the source and target domains are two points on a manifold. For reducing their domain gap, subspaces along the geodesic path are interpolated to connect the two points (Gopalan et al., 2011; Caseiro et al., 2015; Cui et al., 2014). The second group is gradual learning-based (e.g., curriculum learning (Roy et al., 2021), deep clustering (Liang et al., 2021)). In an epoch-wise training fashion, they use the previous epoch model to guide the current training epoch. The third group is domain generation based. The core idea is to generate a flow of intermediate smoothly-shifting domains capable of bridging the domain gap between the source and target domains (Gong et al., 2019). Our method belongs to this group. Importantly, we highlight the key novel designs in comparison: (1) We form the intermediate domains with simple yet reliable probability distributions; (2) We uniquely take into account error control to the progressive learning process; (3) Our formulation is tailored for source free domain adaption, without the need for accessing the source data as required in (Gong et al., 2019).
3 Methodology
In this section, we first formulates the SFDA problem, and then formalize target prediction distribution searching. Finally, we present the optimization analysis for a single-step searching in the matching process.
3.1 Source Data-Free Domain Adaptation Formulation
Given two different but related domains, i.e., the source domain \(\mathrm{{S}}\) and target domain \(\mathrm{{T}}\). Let source \({\mathcal {X}}_s=\{{\varvec{x}_{i}^s\}_{i=1}^{n_s}}\) and \({\mathcal {Y}}_s=\{{y}_{i}^s\}_{i=1}^{n_s}\) be the source samples and the corresponding labels. The target data and their labels are \({\mathcal {X}}_t=\{{\varvec{x}_{i}\}_{i=1}^{n}}\) and \({\mathcal {Y}}_t=\{{y}_{i}\}_{i=1}^{n}\), respectively, in which n is the number of the target data. Both domains remain the same C-way classification task. In the SFDA setting, suppose a source model \(\theta _s:{\mathcal {X}}_s \mapsto {\mathcal {Y}}_s\) is pre-learned by \(\left( {\mathcal {X}}_s, {\mathcal {Y}}_s\right) \), we intend to learn a target model \(\theta _t: {\mathcal {X}}_t \mapsto {\mathcal {Y}}_t\) through an adaptation to the target domain. During the transfer process, only the source model \(\theta _s\) and the unlabeled target data \({\mathcal {X}}_t\) are available.
3.2 Target Prediction Distribution Searching
Unlike conventional feature distribution alignment, we reformulate the SFDA problem as searching the optimal target prediction distribution. We start with the initial prediction distribution \({P}_{\theta _s}\), obtained by applying the source model \(\theta _s\) on \({\mathcal {X}}_t\). The objective of our TPDS is to identify the ideal prediction distribution \(P_{\mathrm{{T}}}\) (unknown), typically different significantly from \({P}_{\theta _s}\) (i.e., large distribution shift/gap). We formulate this as a distribution optimization problem as:
where \({{\hat{P}}}_{\mathrm{{T}}}\) specifies the estimated target prediction distribution, \(\mathrm{{SE}}(\cdot )\) stands for the searching process starting with \({P}_{\theta _s}\), \(\mathrm{{D}}\left( \cdot ,\cdot \right) \) measures the discrepancy of two distributions, and \(\Theta \) refers to the parameters to be learned. Two key challenges for this optimization are that (1) \(P_{\mathrm{{T}}}\) is unknown making it hard to optimize, and (2) domain gap between \({P}_{\theta _s}\) and \(P_{\mathrm{{T}}}\) are large making one-step searching for \({{\hat{P}}}_{\mathrm{{T}}}\) hard to achieve good result.
To overcome both of the challenges above, inspired by the spirit of gradual adaptation (Liang et al., 2020; Abnar et al., 2021), we design a progressive search strategy. Specifically, from the distribution \({P}_{\theta _s}\) to \(P_{\mathrm{{T}}}\), we construct a proxy prediction distribution flow \(P_{\theta _{0}} \rightarrow P_{\theta _{1}} \cdots \rightarrow P_{\theta _{k}} \rightarrow \cdots \rightarrow P_{\theta _{K}}\), where \(P_{\theta _{0}}={P}_{\theta _s}\) with \(\theta _{0}=\theta _{s}\), \(P_{\theta _{K}}={{\hat{P}}}_{\mathrm{{T}}}\) with \(\theta _{K}=\theta _{t}\), \(\theta _{k}\) represents the k-th intermediate model estimating the proxy distribution \(P_{\theta _{k}}\). Consider a metric space induced by measure \(\mathrm{{D}}\) where the \({P}_{\theta _s}\) and \(P_{\mathrm{{T}}}\) can be deemed as two distinct points. In this case, this proxy distribution flow connecting \({P}_{\theta _s}\) and \(P_{\mathrm{{T}}}\) specifies the searching route. Clearly, there are a number of possible choices, but which one can lead to the best matching? Intuitively, the geodesic path is optimal due to that its accumulative domain shift is minimal. In fact, we can theoretically prove the rationality of choosing the geodesic path as follows.
First of all, we select a proper measure to quantify the distribution shift. Of note, the \(P_{\theta _k}\) searching taking \(P_{\theta _{k-1}}\) as the start has one key point that \(P_{\theta _{k}}\) inherits from \(P_{\theta _{k-1}}\), and its shift from \(P_{\theta _{k-1}}\) is small enough. Namely, we can regard the geometric shape of \(P_{\theta _k}\) as derived from \(P_{\theta _{k-1}}\) by a slight geometric change. Under this context, this work does not adopt the popular MMD, but selects a Wasserstein distance as the shift measure \(\mathrm{{D}}\) for two reasons. First, due to having inherent geometric meaning, in theory, Wasserstein distance is more reasonable than others when the two adjacent distributions have a certain geometric relation (Mueller & Jaakkola, 2015). Second, some work (Shen et al., 2018) have verified that Wasserstein distance is better than MMD in the domain adaptation problem. Considering that SFDA is a classification-orientated problem, we use a Wasserstein-infinity distance-based measure (Kumar et al., 2020), denoted by \(D_w(\cdot ,\cdot )\). For any adjacent proxy distributions, the measure of their shift for C-way classification is expressed as
where \({W_\infty }(\cdot ,\cdot )\) is the Wasserstein-infinity distance, random variables \(X_1\) and \(X_2\) stand for the samples satisfying \({P_{{\theta _{k-1}}}}\) and \({P_{{\theta _{k}}}}\) respectively, random variables \(Y_1\) and \(Y_2\) denote the category, the conditional distribution \({{P_{{\theta _{k - 1}}}}}\!\left( { X_1 \mid Y_1 \!= \!c} \right) \) and \({{P_{{\theta _{k}}}}}\!\left( { X_2 \mid Y_2 \!= \!c} \right) \) are probability measures on the c-th category by the \({{{{\theta _{k - 1}}}}}\) and \({{{{\theta _{k}}}}}\) respectively.
With the measure presented in Eq. (2) and the theoretical results in (Kumar et al., 2020), we derive the following Theorem for the transfer performance upper bound of our progressive searching. The proof is given in Appendix A.
Theorem 1
Suppose the distributions in the proxy distribution flow \(\{ {P_{\theta _{k}}} \}_{k=0}^{K}\) satisfy no label shift (the C categories are fixed) and the data is bounded (the data is not too large: \(\Vert \varvec{x}_i\Vert _2^2 \le \rho \), \(\rho >0\) for \( 1 \le i \le n\)). Distribution shifts in this flow are \(\Pi = \{ \pi _k \}_{k=1}^{K}\) where \(\pi _k\) change gradually from 1 to K, and \(\pi _m = \max (\Pi )\). If the source model \(\theta _s=\theta _0\) has low loss \(\alpha _0 \ge \alpha ^*\) on the source domain, then
where \(L\left( {{\theta _K},{P_{{\theta _K}}}} \right) \) is the objective loss as learning \({\theta _K}\) for \({P_{{\theta _K}}}\) prediction, R stands for the regularization strength of \({\theta _K}\), \(\alpha ^{*}\) is a given small loss, n is the size of the target dataset.
Since \({P_{\mathrm{{T}}}}\) in Eq. (1) is unknown, we cannot directly evaluate \(\mathrm{{D}}( {{\hat{P}}}_{\mathrm{{T}}}, {P_{\mathrm{{T}}}})\). We instead analyze the objective loss for predicting \({P_{{\theta _K}}}\) on the target domain, i.e., \(L\left( {{\theta _K},{P_{{\theta _K}}}} \right) \). In case of \(L\left( {{\theta _K},{P_{{\theta _K}}}} \right) =0\), we arrive \({{\hat{P}}}_{\mathrm{{T}}}={P_{\mathrm{{T}}}}\).
Remark
Theorem 1 suggests an insight that reducing the maximal distribution shift \(\pi _m\) of this flow can lower the empirical risk of the resulting distribution \({P_{{\theta _k}}}\) on the target domain. In practice, \(\pi _m\) is not determined until the end of search. To overcome this challenge, we propose to require all distribution shifts \(\{\pi _k\}_{k=1}^{K}\) of this flow to be sufficiently small, so that minimizing the final empirical risk can be approximated. Under the geometry view in the metric space as discussed earlier, our design means that the searching should be along the geodesic path for best matching between \({{\hat{P}}}_{\mathrm{{T}}}\) and \({P_{\mathrm{{T}}}}\).
Together with the aforementioned geometry principle, we realize the proposed learning strategy by transforming the original optimization (Eq. (1)) to the following K sub-problems:
This defines a search process along the geodesic path, as shown in Fig. 1d. Specifically, the k-th sub-problem refers to a single-step search that computes the current distribution \({P_{\theta _{k}}}\), given the previous distribution \({P_{\theta _{k-1}}}\) formed by model \(\theta _{k-1}\). The entire search process of TPDS yields a proxy distribution flow with sufficient small shift in-between.
3.3 Single-Step Searching
As indicated by Eq. (4), a single-step searching is driven by aligning the adjacent distributions, namely minimizing the distance \(\mathrm{{D}}_w\left( P_{\theta _{k-1}},P_{\theta _{k}} \right) \). In practice, we do not adopt the original definition in Eq. (2) to estimate this distance. According to Eq. (2), we need to iteratively compute the Wasserstein-infinity distances of all C categories and take the maximal one as the distribution shift. However, in our context no accurate category information is available. To solve this problem, we propose to minimize all the distances \(\{ {{d_0}, \ldots , {d_c}, \ldots {d_{C - 1}}} \}\), without the need for the category information. This leads to the following reformulation:
This category-agnostic formula allows us to compute the distribution shift in an unsupervised manner, facilitating the subsequent analysis for shift reduction.
For \(d_k\) minimization, we have to figure out two issues: (1) what kind of data support the \(d_k\) estimation, and (2) how to reduce \(d_k\) based on the found data. For the first issue, according to Eq. (5), the answer is clear that \(d_w\) is associated with the data paired by the same category. Given these paired data, for the second issue, due to that \(d_k\) is a kind of Wasserstein distance that builds on a transport in point-to-point way, we can perform a pairwise alignment to mimic this point-to-point process. Obviously, this solution above is not practical for two reasons: (1) we cannot pair the data with the same category accurately due to the absence of real target labels in the SFDA setting; and (2) how to encourage pairwise alignment is not clear.
To fix the first problem, we select data with neighbor relations in the feature space as the data pairs, denoted by \((\varvec{x}_i, \varvec{x}'_i) \in ({\mathcal {X}}_t \times {\mathcal {X}}_t)\) for \(1 \le i \le n\). Despite without the same category constraint, the alignment in transport manner still works. To explain it, we use Fig. 2 to illustrate this situation of aligning \(P_{c=\mathrm{{O}}}\) to \(Q_{c=\mathrm{{O}}}\) based on the data pairs with neighbor relation. Due to the feature closing, \(\varvec{x}_i\) and \(\varvec{x}'_i\) will have a similar but different distribution over all the categories. Thus, as shown in Fig. 2a, the data pairs only have two kinds, the ones sharing the same circle category, termed A group (connected with orange dotted line), and the ones with different but related two categories (their category distributions have an overlap), termed B group (connected with blue dotted line). As shown in Fig. 2b, the transport over A group can align \(P_{c=\mathrm{{O}}}\) to \(Q_{c=\mathrm{{O}}}\); the transport over B group broaden the final aligned distribution from \(Q_{c=\mathrm{{O}}}\) (green oval) to \({\widetilde{Q}}_{c=\mathrm{{O}}}\) (blue oval). It is clear that this transport cannot change the structure of the multi-class distribution but only blurs the category boundary to some extent.
To fix the second problem, considering that the data with the same category are close to each other in feature space, we encourage the pairwise alignment by introducing a pairwise category consistency constraint on the pairs \(\{(\varvec{x}_i, \varvec{x}'_i)\}_{i=1}^{n}\). Here, this consistency is only confined to single data-pair, and different pairs may share different categories.
3.4 Overview of Training
In our TPDS paradigm, the proxy distribution flow is supposed to converge to the ideal target distribution \(P_{\mathrm{{T}}}\) progressively. To this end, the adaptation training process is sliced into K successive stages \(\{E_k\}_{k=1}^K\). In \(E_k\), we perform a single-step searching for \(P_{\theta _k}\) w.r.t \(P_{\theta _{k-1}}\) via training the model from \(\theta _{k-1}\) to \(\theta _k\).
4 Model Instantiation
As a showcase of our paradigm, we implement a TPDS instantiation in deep learning. Without generality loss, we take the search process of \(P_{\theta _k}\) in stage \(E_k\) as an example in detail. Specifically, at the beginning of this stage, the model is initiated by \(\theta _{k-1}\), and a proxy distribution on the target domain is constructed. Next, we search for the optimal \(\theta _k\) in an unsupervised learning manner, achieved by a pairwise alignment with category consistency (PACon) algorithm.
4.1 Model Structure of \(\theta _k\)
During the transfer process, all models, including the source model \(\theta _s\) and all intermediate models \(\{\theta _k\}_{k=1}^{K}\) predicting the proxy distributions, have the same composition. Specifically, \(\theta _k\) consists of a feature extractor \(\phi _k\) and a classifier \(\upsilon _k\) with ending softmax operation, thus \(\theta _k=\upsilon _k \circ \phi _k\) whilst \(\theta _{k-1}=\upsilon _{k-1} \circ \phi _{k-1}\) where the operation \(\circ \) means the function composition. In concrete implementation, we use two neural networks as the two modules: (1) a deep architecture is taken as the feature extractor, and (2) a four layer network is used as the classifier. The more details are given in Implementation Details of Experiments section.
4.2 Overview of PACon
Corresponding to the insight from the previous section, our PACon algorithm has two successive components: (I) Distribution shift estimation, and (II) Distribution shift reduction, as shown in Fig. 3. The first component is based on a credible sampling method for generating data pairs. Specifically, at the beginning of any epoch \(E_k\), all target samples are first embedded by the previous-epoch model \(\theta _{k-1}\) to form a feature space for search. During \(E_k\), given an input batch data, we extract their features using the up-to-date model \(\theta _{k}\) and identify the paired samples in the search space with the chain-like search process (Fig. 4). The second component then aligns \(P_{\theta _k}\) to \(P_{\theta _{k-1}}\) by maximizing the mutual information entropy of those data pairs.
4.3 Distribution Shift Estimation
In the analysis above, we show that the distribution shift estimation is dependent on the data pairs with a neighbor relation. To account for the fact that the deep features locate on a data manifold, our credible sampling for data pairs detection builds the neighbor relation on the feature manifold by a chain-like searching. Furthermore, since the categories obtained in an unsupervised way are noisy, the sampled data forming the pairs are required to be credible, termed credible neighbor.
Suppose the data pairs constructed by the credible sampling are \(\{(\varvec{x}_i, \varvec{x}'_i)\}_{i=1}^{n} \in {\mathcal {X}}_t\) where \(\varvec{x}_i\) (from \({P_{{\theta _{k}}}}\)) and \(\varvec{x}'_i\) (from \({P_{{\theta _{k-1}}}}\)) are the input instance and its credible neighbor respectively. By model \(\theta _{k}\) and \(\theta _{k-1}\) respectively, \(\varvec{x}_i\) and all target data \({\mathcal {X}}_t\) are mapped into the feature space. To be concrete, the feature extractor \(\phi _{k}\) (in \(\theta _{k}\)) transforms \(\varvec{x}_i\) to feature \(\varvec{z}_i^{\star }\). The feature extractor \(\phi _{k-1}\) (in \(\theta _{k-1}\)) maps \({\mathcal {X}}_t\) into features \({\mathcal {Z}}=\{\varvec{z}_i\}_{i=1}^{n}\) where \(\varvec{z}_i=\phi _{k-1}(\varvec{x}_i)\) forming a data manifold. Then the classifier \(\upsilon _{k-1}\) converts \({\mathcal {Z}}\) to probability vectors \({\mathcal {P}}=\{\varvec{p}_i \}_{i=1}^{n}\) where \(\varvec{p}_i=\upsilon _{k-1}(\varvec{z}_i)\). The data pair construction can be performed in the following two steps.
Step A: Credible group construction. Firstly, we generate a group \({\mathcal {G}}_e\) using popular entropy-based ranking over \({\mathcal {Z}}\), like (Liu et al., 2021; Yang et al., 2020). With entropy computation, \({\mathcal {P}}\) converts to entropy set \({\mathcal {H}}=\{{h}_i \}_{i=1}^{n}\) where \({h}_i=-\sum {{\varvec{p}}}_{i}\log {{\varvec{p}}}_{i}\). Thus, \({\mathcal {G}}_{e}\) can be obtained by
where \(\sigma _e\) is also a scaling factor.
We consider this entropy based strategy (i.e., \({\mathcal {G}}_e\)) is limited in the sense that there exists a many-to-one projection problem between the prediction distributions and the entropy values, leading to ambiguous selection. To mitigate this problem, we introduce another selection criterion based on class-aware feature geometrical structure with a particular stress on the most likely class prediction. This provides additional information to the entropy measurement, while being highly correlated and thus redundant. This is because the higher probability the most likely class receives, the lower entropy for the prediction distribution.
Specifically, to enhance the credibility further, we split off another group, \({\mathcal {G}}_{o}\), by clustering-based ranking. We obtain C cluster centers by a weighted k-means method formulated by Eq. (7) where \(p_{i,c}\) is the c-th element of vector \(\varvec{p}_i\).
Thus, the data credibility can be expressed by the minimum distance of the sample from the C cluster centers. Suppose the distances of \(\varvec{z}_i\) from \(\{\varvec{o}_c\}_{c=1}^{C}\) form vector \(\varvec{b}_i \in {\mathbb {R}}^{C}\). The c-th element of \(\varvec{b}_i\), standing for the distance from the c-th cluster center, equals \(D_{cos}\left( \varvec{z}_i, \varvec{o}_c\right) \) where \(D_{cos}\) means the cosine-distance. Let \(a_i=\mathrm{{min}}(\varvec{b}_i)\) be \(\varvec{x}_i\)’s minimum distance from the C centers, so that we get a measure set \({\mathcal {A}}=\{{a}_i \}_{i=1}^{n}\) over the target data. Thus, we can obtain \({\mathcal {G}}_o\) by
where \(\sigma _o\) is a scaling factor, \(\mathrm{{topk}}\left( {\mathcal {X}}, k\right) \) selects the top k lowest elements from set \({\mathcal {X}}\) and returns their indexes.
Finally, we get the credible data group \({\mathcal {G}}\) by an intersection operation as \({\mathcal {G}} = {\mathcal {G}}_o \cap {\mathcal {G}}_e\). In Fig. 4, these feature data belonging to \({\mathcal {G}}\) are marked in blue.
Step B: Data pair generation. To construct the data pairs, we find credible neighbor \(\varvec{x}'_i\) from \({\mathcal {G}}\) by a chain-like search, as illustrated in Fig. 4. Starting with the feature of the input instance \(\varvec{z}_i^{\star }\) (the red triangle), we carry out a one-step search to find its nearest neighbor, denoted by \(\hat{\varvec{z}}_1\), based on the cosine distance. If \(\hat{\varvec{z}}_1\) does not belong to \({\mathcal {G}}\), we conduct another one-step search for the new nearest neighbor \(\hat{\varvec{z}}_2\), taking \(\hat{\varvec{z}}_1\) as the start. Repeating this process, we build a search flow \(\{\varvec{z}_i^{\star }, \hat{\varvec{z}}_1, \hat{\varvec{z}}_2, \dots , {\varvec{z}'}_i\}\) reaching \({\mathcal {G}}\). In this flow, the end element \({\varvec{z}'}_i = \phi _{k-1}(\varvec{x}'_i)\) (the yellow triangle) is the feature of the credible neighbor that we are seeking whilst other elements (marked in light blue), except \(\varvec{z}_i^{\star }\), are the intermediate features. Note that to avoid the circle path on the flow, our search forgets its history. For example, when searching for \(\hat{\varvec{z}}_k\), the historical elements \(\varvec{z}_i^{\star }\) and \(\{\hat{\varvec{z}}_j\}_{j=1}^{k-1}\) are excluded from the similarity comparison.
4.4 Distribution Shift Reduction
In our instantiation, the distribution shift is reduced by pairwisely aligning these detected data pairs under the category consistency constraint. Inspired by the theoretical correlation of mutual information and pairwise losses (Boudiaf et al., 2020), we use the following objective to reach the goal where \(I(\cdot ,\cdot )\) is the mutual information function (Paninski, 2003) whose computation is the same as the way in (Ji et al., 2019).
Of note, \({L}_W\) is sensitive to the target dataset scale. When the scale is small, since the limited data cannot describe the probability distribution well, the optimization effect based on the single regulator \({L}_W\) is restricted. To overcome this problem, we introduce the diversity loss encouraging the balance of category prediction. This skill is widely adopted by the unsupervised approaches for multi-way classification (Jabi et al., 2019) to avoid the solution collapse (Ghasedi Dizaji et al., 2017) in which the model predicts all data as some specific categories. Suppose that \(\theta _k\) transforms \(\varvec{x}_i\) to a probability vector \(\varvec{q}_i\), this loss is expressed as
where \(\mathrm{{KL}}(\cdot ||\cdot )\) means the \(\mathrm{{KL}}\)-divergence loss function; \({\varrho }_{\{c=1, \cdots , C\}} = \frac{1}{C}\) is uniform distribution; \(\bar{q}_c = \frac{1}{n} \sum _{i=1}^{n}{q}_{i,c}\) is empirical label distribution, in which \({q}_{i,c}\) is the probability of \(\varvec{x}_i\) in the c-th category. Combining with Eq. (9), we have the final objective:
where \(\beta _n\) trades off the two regularizations; its value is related to the dataset scale (represented by the target data number n): the smaller the dataset scale, the larger the \(\beta _n\) value (Its rationality is verified in Experiments). For clarity, we also summarize the overall training of TPDS to Alg. 1.
5 Experiments and Analyses
5.1 Data Sets
In this paper, we evaluate our method on five widely used benchmarks as follows.
Digits (Hoffman et al., 2018). As a typical dataset in UDA problems, we use the three most frequently used subsets under Digits, i.e., SVHN (S), MNIST (M), and USPS (U). They contain the images of digits from 0 to 9 in different environments. We trained the method on three relatively challenging cross-domain tasks on the digit dataset, i.e., S\(\rightarrow \)M, U\(\rightarrow \)M, and M\(\rightarrow \)U.
Office-31 (Saenko et al., 2010). Office-31 is a small-scale dataset that is widely used in domain adaptation including three domains, i.e., Amazon (A), Webcam (W), and Dslr (D), all of which are taken of real-world objects in various office environments. The dataset has 4,652 images of 31 categories in total. Images in A are online e-commerce pictures. W and D consist of low-resolution and high-resolution pictures, respectively.
Office-Home (Venkateswara et al., 2017). Office-Home is a medium-scale dataset that is mainly used for domain adaptation, all of which contains 15k images belonging to 65 categories from working or family environments. The dataset has four distinct domains, i.e., Artistic images (Ar), Clip Art (Cl), Product images (Pr), and Real-word images (Rw).
VisDA (Peng et al., 2017). VisDA is a challenging large-scale dataset with 12 types of synthetic to real transfer recognition tasks. The source domain contains 152k synthetic images, while the target domain has 55k real object images from Microsoft COCO.
PACS (Li et al., 2017). PACS is an image dataset for domain generalization. It consists of four subdomains with 9.9k images sharing seven categories. The domains are Photo (P), Art Painting (A), Cartoon (C) and Sketch (S).
5.2 Implementation Details
5.2.1 Neural Network Architecture
We design and implement our network architecture based on Pytorch. We can divide the above datasets into two types: Digit recognition and Object recognition. For the digit recognition task, we use a variational LeNet as a feature extraction module, as done in Liang et al. (2020). For the object recognition task, following the standard practice for fair comparison, we use neural networks including both the feature extractor \(\phi _k\) and the classifier \(\upsilon _k\) per model \(\theta _k\). The feature extractor \(\phi _k\) contains a heavyweight deep architecture, a batch-normalization layer and a full-connect layer with a size of 2048x256. As done in Liang et al. (2020), Yang et al. (2021), Tang et al. (2022) for the deep architecture, we adopt ImageNet pretrained ResNet50 (He et al., 2016) on Office-31, Office-Home and PACS, and ResNet101 (He et al., 2016) on VisDA. For all datasets, the classifier \(\upsilon _k\) takes the same structure as initially used in (Liang et al., 2020; Yang et al., 2020, 2021; Tang et al., 2022). The input layer is a fully-connected layer with batch normalization. The output layer is a fully-connected layer with weight normalization.
5.2.2 Source Model
\(\theta _s\) training. For all evaluation datasets, \(\theta _s\) was pretrained in the standard protocol (Liang et al., 2020; Tang et al., 2021; Yang et al., 2021). The adopt objective for training is given in Appendix B. We split the labelled source data into two parts of 90%:10% for model pretraining and validation. We set the training epochs on Digit, Office-31, Office-Home, PACS and VisDA to 30, 100, 50, 50 and 10, respectively.
5.2.3 TPDS Training
We adopt the stochastic gradient descent (SGD) optimizer with a momentum of 0.9 and weight decay of 0.001. The learning rate is set to 0.01 for Office-31, Office-Home and PACS, 0.001 for VisDA. We train 15 epochs at a batch size of 64 on each target domain. TPDS has three hyperparameters: We set the scaling factors \((\sigma _o, \sigma _e)=(0.6,0.5)\) for the credible group construction on all target domains, whilst \(\beta _n=1/0.5/0\) for small dataset (Office-31), medium dataset (Office-Home and PACS) and large dataset (VisDA) according to the dataset scale.
5.3 Competitors
To verify the effectiveness of our method, we select 35 comparison methods, which can be divided into following two groups according to whether access to the source data during the transfer phase.
-
(1)
19 state-of-the-art vanilla domain adaptation methods, all requiring source and target data at the same time to solve the domain shift. They are ADDA (Tzeng et al., 2017), ADR (Saito et al., 2018), CDAN (Long et al., 2018), CyCADA (Hoffman et al., 2018), SWD (Lee et al., 2019), CAT (Deng et al., 2019), BSP (Chen et al., 2019), TN (Wang et al., 2019), SAFN (Xu et al., 2019), IA (Jiang et al., 2020), DMRL (Wu et al., 2020), STAR (Lu et al., 2020), MCC (Jin et al., 2020), CGDM (Du et al., 2021), TCM (Yue et al., 2021), SRDC (Tang et al., 2020), SUDA (Zhang et al., 2022), CaCo (Huang et al., 2022) and MSGD (Xia et al., 2022).
-
(2)
15 current state-of-the-art SFDA models, such as SFDA (Kim et al., 2021), 3C-GAN (Li et al., 2020), SHOT (Liang et al., 2020), BAIT (Yang et al., 2020), HMI (Lao et al., 2021), PCT (Tanwisuth et al., 2021), CPGA (Qiu et al., 2021), AAA (Li et al., 2022), PS (Du et al., 2021), GKD (Tang et al., 2021), A2Net (Xia et al., 2021), NRC (Yang et al., 2021), VDM (Tian et al., 2022), NEL (Ahmed et al., 2022) and U-SFAN+ (Roy et al., 2022). Among them, method 3C-GAN, PS, AAA, A2Net and VDM are based on the pseudo source domain generation or construction, whilst the rest methods are based on the framework of self-supervised learning.
5.4 Comparative Results
5.4.1 Digit Recognition
As reported in Table 1, TPDS obtains the best results on the all tasks compared with SHOT and has a 0.3% increase in average accuracy. Compared with these UDA work, TPDS achieves the highest performances on 2 out of 3 tasks, except for the transfer task U\(\rightarrow \)M, surpassing the best method SWD by 0.4% in average accuracy.
5.4.2 Object Recognition
Table 2, 3, 4, 5 present the quantitative results on the four datasets. On Office-31 (Table 2), TPDS obtains best results on two transfer tasks among these SFDA methods, A\(\rightarrow \)D and A\(\rightarrow \)W, leading to the 90.2% average accuracy, which increases by 0.2% compared to the second-best method A2Net (90.0%). On Office-Home (Table 3), TPDS defeats other methods in 5 out of 12 tasks. Compared with the previous best method A2Net (72.8%), TPDS improves by 0.7% average accuracy and reaches 73.5%. On VisDA (Table 4), besides two categories, person and truck, TPDS achieves the best performance. In average accuracy, TPDS obtain 87.6% and surpasses the previous best method NRC by 1.7%. On PACS (Table 5), TPDS obtains best results on all except for two transfer tasks with small gap, A\(\rightarrow \)P and C\(\rightarrow \)S. Especially, TPDS improves by 20.8% and 15.1% on transfer task A\(\rightarrow \)S and task P\(\rightarrow \)S respectively compared with the second-best method GKD. As a result, TPDS defeats SHOT by a margin of 3.7% in average accuracy.
Besides, compared with these conventional UDA methods needing the access to the source data, TPDS is also competitive on the three object recognition datasets, as shown in Table 2, 3, 4, despite without the facilitating from source data during the adaptation phase. Specifically, on the Office-31 dataset, TPDS has a gap of 0.6% compared with the best UDA method MSGD. However, with the increase of target data, the advantage of TPDS grows further, surpassing the best method MSGD by 1.1% and 3.0% respectively on the Office-Home and VisDA datasets. To sum up, these comparison results on the five datasets mentioned above confirm the state-of-the-art performance of TPDS.
5.5 Further Analyses
5.5.1 Feature Visualization
Using the widely used visualization tool t-SNE (Van der Maaten & Hinton, 2008), we conduct a feature visualization experiment based on the 65-way classification results of transfer task Cl\(\rightarrow \)Ar in the Office-Home dataset. Figure 5 presents the visualization results. As shown in the top, before adaptation, the intertwined features, embedded by the source model \(\theta _s\), distribute without apparent aggregation (the first sub-figure); after adaptation, the features aggregate evidently (the ending sub-figure). From left to right, the three sub-figure show that the features gradually cluster during the adaptation phase. For clear observation, we select the first 30 from total of 65 categories to present the clustering details. As shown in the bottom, where different colors stand for different categories, the aggregation is performed with category meaning. The visualization results show that TPDS can predict a probability distribution with category meaning.
5.5.2 Hyperparameter Sensitivity
To validate the sensitivity of \(\sigma _o\) and \(\sigma _e\), we conduct 30 experiments as \(\sigma _o \in [0.3, 0.8]\) and \(\sigma _o \in [0.3, 0.7]\) based on task Cl\(\rightarrow \)Ar in the Office-Home dataset. As shown on the left of Fig. 6, the accuracy does not change drastically. Thus, TPDS’s performance is robust to \(\sigma _o\), \(\sigma _e\). As for the sensitivity of \(\beta _n\), we conduct 6 experiments as \(\beta _n\) varies from 0 to 1.0 on each testing dataset. As shown in the middle of Fig. 6, as the dataset size increases, Office-31 \(\rightarrow \) Office-Home \(\rightarrow \) VisDA, the best result occurs when \(\beta _n=1.0/0.6/0.0\), respectively. This phenomenon is consistent with our expectation. As mentioned above, when the amount of data is not enough to describe the distribution, \(L_{\mathrm{{B}}}\) can boost the \(L_{\mathrm{{W}}}\)-only-based transfer. Conversely, \(L_{\mathrm{{B}}}\) will deteriorate the performance. These results confirm the rationality of our setting \(\beta _n\) to be related to the dataset size.
5.5.3 Training Stability
On the right of Fig. 6, we present the training stability of TPDS on task Cl\(\rightarrow \)Ar in the Office-Home dataset. As the training goes from epoch 1–15, the accuracy rapidly climbs at the early epochs (from epoch 1–4) and converges to the maximum through a slow increase with small vibrations (from epoch 4–13). The loss value of \(L_{\mathrm{{TPDS}}}\) over all epochs gradually decreases, whose trend is consistent with the performance change. The phenomenon indicates that the training of TPDS is stable and reliable.
5.5.4 Extentability
In the spirit of SHOT++ (Liang et al., 2021) that leverages additional training components (e.g., MixMatch for data augmentation (Berthelot et al., 2019)) on top of SHOT, we carry out a test for model extentability where our TPDS is further enhanced by MixMatch, termed TPDS+MixMatch. As reported in Table 6, TPDS+MixMatch obtains better results compared with the original version, which confirms the extentability of our TPDS.
5.5.5 Computational Cost
We compare our method with SHOT in terms of the average training time. The results in Table 7 show that our model training is some slower due to the gradual adaptation nature, which is a reasonable cost for better adaptation performance.
5.5.6 Convergence of Chain-like Search
We evaluate the convergence of our chain-like search. We first track the average steps of the search during the training phase. As shown in Table 8, our model takes less than two steps to reach another sample in the credible group. The statistics as reported in Fig. 7 further indicates that our search process converges well across the training epochs.
5.6 Distribution Shift Analysis
TPDS is a probability distribution alignment based scheme. This sub-section gives a distribution shift analysis using the measure of MMD distance (ZongxianLee, 2019) to verify whether TPDS reduces the match error via the progressive alignment. In this experiment, we feedforward the target data \({\mathcal {X}}_t\) through the source model \(\theta _s\) and take the outputs as an empirical estimation of the source distribution, i.e., \({P}_{\mathrm{{S}}}=\theta _s({\mathcal {X}}_t)\). Besides, we train an idea target model \(\tilde{\theta }_t\) over \({\mathcal {X}}_t\) with labels like the source model training (Appendix B) and use the outputs to express the ideal target domain, i.e., \({P}_{\mathrm{{T}}}=\tilde{\theta }_t({\mathcal {X}}_t)\). For comparison, we select SHOT as the baseline owing to the same epoch-wise learning strategy but without errors mitigation. The prediction target distributions obtained by TPDS and SHOT are presented via \({\hat{P}}_{\mathrm{{T:tpds}}}=\theta _{tpds}({\mathcal {X}}_{t})\) and \({\hat{P}}_{\mathrm{{T:shot}}}=\theta _{shot}({\mathcal {X}}_{t})\), respectively.
The first two sub-figures in Fig. 8 display the MMD distance variation to \({P}_{\mathrm{{S}}}\) and \({P}_{\mathrm{{T}}}\) during the training, respectively. Both \({\hat{P}}_{\mathrm{{T:tpds}}}\) and \({\hat{P}}_{\mathrm{{T:shot}}}\) move away from \({P}_{\mathrm{{S}}}\), but the distance of \({\hat{P}}_{\mathrm{{T:tpds}}}\) to \({P}_{\mathrm{{S}}}\) keeps always less than \({\hat{P}}_{\mathrm{{T:shot}}}\) (the left). This indicates that our progressive alignment is effective for reducing the distance to \({P}_{\mathrm{{S}}}\). Thus, the accurate guidance provided by \({P}_{\mathrm{{S}}}\) are better reserved. Correspondingly, \({\hat{P}}_{\mathrm{{T:tpds}}}\) and \({\hat{P}}_{\mathrm{{T:shot}}}\) close to \({P}_{\mathrm{{T}}}\) with an interesting observation (the middle). SHOT’s distance decreases rapidly at the early epochs (from 1 to 5), followed by a gradual increase after 6-epoch. In contrast, TPDS’s distance presents a declining trend through all the epochs. This phenomenon is understandable. In the late stages, the errors in pseudo-labels propagate further in the adaptation regulated by SHOT. However, TPDS can control this propagation due to that we introduce the adapt error mitigation mechanism.
According to Theorem 1, a progressive alignment on the proxy distribution flow encourages a matching error minimization to the ideal target distribution \({P}_{\mathrm{{T}}}\). To verify this conclusion, we perform a further distribution shift analysis on all 12 transfer tasks in the Office-Home dataset. Unlike the analysis mentioned above, we do not use the intermediate models, insteading of the final model finishing 15 epochs training. Besides, the horizontal and vertical coordinates are changed to the distance to \({P}_{\mathrm{{S}}}\) and \({P}_{\mathrm{{T}}}\), respectively, to discover the relation between them. To account for the different cross-domain shift on these tasks, we take a normalization operation on the MMD distance for a clear view. Specifically, the distance of \({\hat{P}}_{\mathrm{{T:tpds}}}\) to \({P}_{\mathrm{{S}}}\) and \({P}_{\mathrm{{T}}}\) are normalized by the corresponding ones of \({\hat{P}}_{\mathrm{{T:shot}}}\), respectively. As shown in the right of Fig. 8, all task points locate in the \([0,1] \times [0,1]\) area and arrange along a red line with a positive slope obtained by linear regression. It is shown that when the distribution shift from the prediction target distribution (\({\hat{P}}_{\mathrm{{T:tpds}}}\)) to the source distribution (\({P}_{\mathrm{{S}}}\)) is controlled to small, the shift from \({\hat{P}}_{\mathrm{{T:tpds}}}\) to \({P}_{\mathrm{{T}}}\), i.e., the matching error, is suppressed correspondingly. Furthermore, they have a positive correlation. Clearly, the results are consistent with the stated in Theorem 1.
5.7 Ablation Study
5.7.1 Effectiveness of Objective Components
This part isolates the effect of the loss components in objective \(L_{\mathrm{{TPDS}}}\). Our ablation experiment is conducted in an incremental way, as shown in Table 9. When both \(L_{\mathrm{{W}}}\) and \(L_{\mathrm{{B}}}\) are unavailable (the first row), this is the result obtained by the source model \(\theta _s\) only. When only \(L_{\mathrm{{W}}}\) is used for adaptation, the adapted model improves by 10.0% at least. When \(L_{\mathrm{{B}}}\) is added, the performance improves further. The comparisons show that both \(L_{\mathrm{{W}}}\) and \(L_{\mathrm{{B}}}\) effectively improve the transfer performance.
5.7.2 Effectiveness of Progressive Searching Strategy
In the TPDS praradigm, the usage of progressive searching strategy simplifies the large domain shift, between the source and target domain, into several successive single-step searching tasks with small shift. To verify its effect, we give a variation method, denoted as TPDS-w/o-Progressive, without our progressive error control but a fixed one. Specifically, through all the epochs, we encourage the current distribution \(P_{\theta _{k}}\) align to the source distribution \(P_{\theta _{s}}\), rather than the previous epoch one \(P_{\theta _{k-1}}\). As reported in the first two rows of Table 10, TPDS-w/o-Progressive is still superior to SHOT but clearly inferior to our full model. This indicates the efficacy of the proposed progressive error control strategy.
5.7.3 Effectiveness of Credible Sampling
In our TPDS instantiation, the credible sampling is the key procedure to estimate the distribution shift between two adjacent proxy distributions. It involves two technical components: (1) In the search space, i.e., the credible group \({\mathcal {G}}\), (2) the credible neighbors are detected by a chain-like search on a feature manifold. To isolate their effectiveness, we give two TPDS variations as:
-
(1)
TPDS-w/o-\({\mathcal {G}}\): We extend the search space to the overall features of all target data instead of the original credible group \({\mathcal {G}}\). Due to the absence of \({\mathcal {G}}\), rendering the chain-like search unavailable, we directly select the nearest data as the credible neighbor based on the cosine distance computation.
-
(2)
TPDS-w/o-ChainSearch: We directly project the input feature sample to the feature space formed by \({\mathcal {G}}\), without detecting the credible neighbor by the chain-like search.
As reported in Table 10, the performances of the three methods rank in descending order as TPDS > TPDS-w/o-Manifold > TPDS-w/o-\({\mathcal {G}}\) in average accuracy. The results indicate that the introducing of both credible group and the manifold hypothesis with chain-like search can boost the final performance. Also, of note that the implementation difference of TPDS-w/o-Manifold and TPDS-w/o-\({\mathcal {G}}\) is whether adopt the credible group \({\mathcal {G}}\) as the search space. The better results of TPDS-w/o-Manifold than TPDS-w/o-\({\mathcal {G}}\) imply that finding credible data are helpful for our unsupervised learning. It is understandable that we absorb the guidance of accurate category information.
5.7.4 Effectiveness of the Cross-Distribution Pairwise Alignment Based on Mutual Information
The cross-distribution pairwise alignment on adjacent proxy distributions is encouraged by the mutual information (MI) maximization on the generated data pairs. To evaluate its effectiveness, we propose two comparisons using conventional measure for the aligning: (1) TPDS-w-MMD, and (2) TPDS-w-KL where the optimization objective of mutual information is changed to MMD and KL-Divergence, respectively.
From the fifth and sixth rows in Table 10, it is seen that both TPDS-w-MMD and TPDS-w-KL have a large gap on all datasets compared with TPDS, along with lower performance than the source-model on Office-31 and Office-Home. This phenomenon confirms the critical role of mutual information maximization in our pairwise alignment. In addition, performance deterioration of the two comparison methods is explainable that the loss of MMD and KL-Divergence are not pairwise objectives whose computations are based on the entire set.
5.7.5 Effectiveness of Credible Sample Construction
To evaluate the credible sample construction, we compare our design (\({\mathcal {G}}_{e} \cap {\mathcal {G}}_{o}\)) with only using either criterion (\({\mathcal {G}}_{e}\) or \({\mathcal {G}}_{o}\)), termed TPDS-w-\({\mathcal {G}}_e\) and TPDS-w-\({\mathcal {G}}_e\). The results in Table 10 (the last row group) shows that the two selection criteria are complementary for the performance benefit, validating the efficacy of our design.
6 Conclusion
In this work, we have proposed a new Target Prediction Distribution Searching (TPDS) paradigm for SFDA. Unlike the previous methods adopting the conventional feature distribution alignment strategy, TPDS seeks for the target prediction distribution with a principled adaptation error mitigation mechanism. Concretely, we construct a flow of proxy prediction distributions and regulate them to be slightly shifted between adjacent ones. Such that this flow smoothly converge to the target distribution long the geodesic path, on which the overall cumulative errors can be elegantly alleviated. The experiment results on five benchmarks show that TPDS can achieve state-of-the-art performance under the SFDA setting.
References
Abnar, S., Berg, R. v. d., Ghiasi, G., Dehghani, M., Kalchbrenner, N., & Sedghi, H. (2021). Gradual domain adaptation in the wild: When intermediate distributions are absent. Retrieved from arXiv preprint arXiv:2106.06080
Ahmed, W., Morerio, P., & Murino, V. (2022). Cleaning noisy labels by negative ensemble learning for source-free unsupervised domain adaptation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 1616-1625).
Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., & Raffel, C. A. (2019). Mixmatch: A holistic approach to semi-supervised learning. In Advances in neural information processing systems (pp. 5061-5072).
Boudiaf, M., Rony, J., Ziko, I. M., Granger, E., Ped-ersoli, M., Piantanida, P., & Ayed, I. B. (2020). A unifying mutual information view of metric learning: cross-entropy vs. pairwise losses. In Eccv 2020 (pp. 548-564).
Caseiro, R., Henriques, J.-F., Martins, P., & Batista, J. (2015). Beyond the shortest path: Unsupervised domain adaptation by sampling subspaces along the spline flow. In IEEE conference on computer vision and pattern recognition (pp. 3846-3854).
Chen, H.-Y., & Chao, W.-L. (2021). Gradual domain adaptation without indexed intermediate domains. In Advances in neural information processing systems (pp. 8201-8214).
Chen, W., Lin, L., Yang, S., Xie, D., Pu, S., Zhuang, Y., & Ren, W. (2021). Self-supervised noisy label learning for source-free unsupervised domain adaptation. Retrieved from arXiv preprint arXiv:2102.11614
Chen, X., Wang, S., Long, M., & Wang, J. (2019). Transferability vs. discriminability: Batch spectral penalization for adversarial domain adaptation. In International conference on machine learning (pp. 1081-1090).
Chidlovskii, B., Clinchant, S., & Csurka, G. (2016). Domain adaptation in the absence of source domain data. In International conference on knowledge discovery and data mining (pp. 451-460).
Cui, Z., Li, W., Xu, D., Shan, S., Chen, X., & Li, X. (2014). Flowing on Riemannian manifold: Domain adaptation by shifting covariance. IEEE Transactions on Cybernetics, 44(12), 2264–2273.
Deng, Z., Luo, Y., & Zhu, J. (2019). Cluster alignment with a teacher for unsupervised domain adaptation. In IEEE international conference on computer vision (pp. 9943-9952).
Du, Y., Yang, H., Chen, M., Jiang, J., Luo, H., & Wang, C. (2021). Generation, augmentation, and alignment: A pseudo-source domain based method for source-free domain adaptation. Retrieved from arXiv preprint arXiv:2109.04015
Du, Z., Li, J., Su, H., Zhu, L., & Lu, K. (2021). Cross- domain gradient discrepancy minimization for unsu-pervised domain adaptation. In IEEE conference on computer vision and pattern recognition (pp. 39373946).
Ghasedi Dizaji, K., Herandi, A., Deng, C., Cai, W., & Huang, H. (2017). Deep clustering via joint convo-lutional autoencoder embedding and relative entropy minimization. In IEEE international conference on computer vision (pp. 5736-5745).
Gong, R., Li, W., Chen, Y., & Gool, L. V. (2019). Dlow: Domain flow for adaptation and generalization. In IEEE/CVF conference on computer vision and pattern recognition (pp. 2477-2486).
Gopalan, R., Li, R., & Chellappa, R. (2011). Domain adaptation for object recognition: An unsupervised approach. In IEEE international conference on computer vision (pp. 999-1006).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition (pp. 1180-1189).
Hoffman, J., Tzeng, E., Park, T., Zhu, J., Isola, P., Saenko, K., Darrell, T. (2018). Cycada: Cycle-consistent adversarial domain adaptation. In International conference on machine learning (pp. 19942003).
Huang, J., Guan, D., Xiao, A., Lu, S., & Shao, L. (2022). Category contrast for unsupervised domain adaptation in visual tasks. In IEEE conference on computer vision and pattern recognition (pp. 1203-1214).
Jabi, M., Pedersoli, M., Mitiche, A., & Ayed, I. B. (2019). Deep clustering: On the link between discriminative models and k-means. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(6), 1887–1896.
Ji, X., Henriques, J. F., & Vedaldi, A. (2019). Invariant information clustering for unsupervised image classification and segmentation. In IEEE conference on computer vision and pattern recognition (pp. 98659874).
Jiang, X., Lao, Q., Matwin, S., & Havaei, M. (2020). Implicit class-conditioned domain alignment for un-supervised domain adaptation. In International conference on machine learning (pp. 4816-4827).
Jin, Y., Wang, X., Long, M., & Wang, J. (2020). Minimum class confusion for versatile domain adaptation. In Europeon conference on computer vision (pp. 464480).
Kim, Y., Cho, D., Han, K., Panda, P., & Hong, S. (2021). Domain adaptation without source data. IEEE Transactions on Artificial Intelligence, 2(6), 508–518.
Kumar, A., Ma, T., & Liang, P. (2020). Understanding self-training for gradual domain adaptation. In International conference on machine learning (pp. 54685479).
Lao, Q., Jiang, X., & Havaei, M. (2021). Hypothesis disparity regularized mutual information maximization. In The AAAI conference on artificial intelligence (pp. 8243-8251).
Lee, C.-Y., Batra, T., Baig, M. H., & Ulbricht, D. (2019). Sliced wasserstein discrepancy for unsuper-vised domain adaptation. In IEEE conference on computer vision and pattern recognition (pp. 1028510295).
Li, D., Yang, Y., Song, Y.-Z., & Hospedales, T. M. (2017). Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision (pp. 5542-5550).
Li, J., Du, Z., Zhu, L., Ding, Z., Lu, K., & Shen, H. T. (2022). Divergence-agnostic unsupervised domain adaptation by adversarial attacks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11), 8196–8211.
Li, R., Jiao, Q., Cao, W., Wong, H.-S., & Wu, S. (2020). Model adaptation: Unsupervised domain adaptation without source data. In IEEE conference on computer vision and pattern recognition (pp. 9638-9647).
Liang, J., Hu, D., & Feng, J. (2020). Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International conference on machine learning (pp. 60286039).
Liang, J., Hu, D., Wang, Y., He, R., & Feng, J. (2021). Source data-absent unsupervised domain adaptation through hypothesis transfer and labeling transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2021.3103390
Liu, Y., Zhang, W., & Wang, J. (2021). Source-free domain adaptation for semantic segmentation. In IEEE conference on computer vision and pattern recognition (pp. 1215-1224).
Long, M., Cao, Y., Wang, J., & Jordan, M. (2015). Learning transferable features with deep adaptation networks. In International conference on machine learning (pp. 97-105).
Long, M., Cao, Z., Wang, J., & Jordan, M. (2018). Conditional adversarial domain adaptation. In Advances in neural information processing systems (pp. 1647-1657).
Lu, Z., Yang, Y., Zhu, X., Liu, C., Song, Y.-Z., & Xiang, T. (2020). Stochastic classifiers for unsupervised domain adaptation. In IEEE/CVF conference on computer vision and pattern recognition (pp. 9111-9120).
Mueller, J. W., & Jaakkola, T. (2015). Principal differences analysis: Interpretable characterization of differences between distributions. Advances in Neural Information Processing Systems, 28
Muller, R., Kornblith, S., & Hinton, G. E. (2019). When does label smoothing help? In Advances in neural information processing systems (pp. 4696-4705).
Munro, J., & Damen, D. (2020). Multi-modal domain adaptation for fine-grained action recognition. In IEEE conference on computer vision and pattern recognition (pp. 119-129).
Pan, Y., Yao, T., Li, Y., Wang, Y., Ngo, C.-W., & Mei, T. (2019). Transferrable prototypical networks for unsupervised domain adaptation. In IEEE conference on computer vision and pattern recognition (pp. 2239-2247).
Paninski, L. (2003). Estimation of entropy and mutual information. Neural Computation, 15(6), 1191–1253.
Peng, X., Usman, B., Kaushik, N., Hoffman, J., Wang, D., & Saenko, K. (2017). Visda: The visual domain challenge. Retrieved from arXiv preprint arXiv:1710.06924
Qiu, Z., Zhang, Y., Lin, H., Niu, S., Liu, Y., Du, Q., & Tan, M. (2021). Source-free domain adaptation via avatar prototype generation and adaptation. In International joint conference on artificial intelligence.
Roy, S., Krivosheev, E., Zhong, Z., Sebe, N., & Ricci, E. (2021). Curriculum graph co-teaching for multi-domain adaptation. In IEEE/CVF conference computer vision and pattern recognition (pp. 5351-5360).
Roy, S., Trapp, M., Pilzer, A., Kannala, J., Sebe, N., Ricci, E., & Solin, A. (2022). Uncertainty-guided source-free domain adaptation. In European conference on computer vision (pp. 537-555).
Saenko, K., Kulis, B., Fritz, M., & Darrell, T. (2010). Adapting visual category models to new domains. In Europeon conference on computer vision (pp. 213226).
Saito, K., Ushiku, Y., Harada, T., & Saenko, K. (2018). Adversarial dropout regularization. In International conference on learning representations: OpenRe-view.net.
Shen, J., Qu, Y., Zhang, W., & Yu, Y. (2018). Wasser-stein distance guided representation learning for domain adaptation. In AAAI conference on artificial intelligence (Vol. 32).
Tang, H., Chen, K., & Jia, K. (2020). Unsupervised domain adaptation via structurally regularized deep clustering. In IEEE/CVF conference on computer vision and pattern recognition (pp. 8725-8735).
Tang, S., Ji, Y., Lyu, J., Mi, J., & Zhang, J. (2019). Visual domain adaptation exploiting confidence-samples. In Ieee international conference on intelligent robots and systems (pp. 1173-1179).
Tang, S., Shi, Y., Ma, Z., Li, J., Lyu, J., Li, Q., & Zhang, J. (2021). Model adaptation through hypothesis transfer with gradual knowledge distillation. In IEEE international conference on intelligent robots and systems (pp. 5679-5685).
Tang, S., Zou, Y., Song, Z., Lyu, J., Chen, L., Ye, M., & Zhang, J. (2022). Semantic consistency learning on manifold for source data-free unsupervised domain adaptation. Neural Networks, 152, 467–478.
Tanwisuth, K., Fan, X., Zheng, H., Zhang, S., Zhang, H., Chen, B., & Zhou, M. (2021). A prototype-oriented framework for unsupervised domain adaptation.
Tian, J., Zhang, J., Li, W., & Xu, D. (2022). Vdm-da: Virtual domain modeling for source data-free domain adaptation. IEEE Transactions on Circuits and Systems for Video Technology, 32(6), 3749–3760.
Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial discriminative domain adaptation. In IEEE conference on computer vision and pattern recognition (pp. 2962-2971).
Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., & Dar-rell, T. (2014). Deep domain confusion: Maximizing for domain invariance. Retrieved from arXiv preprint arXiv:1412.3474
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9(11), 2579–2605.
Venkateswara, H., Eusebio, J., Chakraborty, S., & Pan-chanathan, S. (2017). Deep hashing network for un-supervised domain adaptation. In IEEE conference on computer vision and pattern recognition (pp. 53855394).
Wang, H., Li, B., & Zhao, H. (2022). Understanding gradual domain adaptation: Improved analysis, optimal path and beyond. Retrieved from arXiv preprint arXiv:2204.08200
Wang, X., Jin, Y., Long, M., Wang, J., & Jordan, M. (2019). Transferable normalization: Towards improving transferability of deep neural networks. In Advances in neural information processing systems (pp. 1951-1961)
Wu, Y., Inkpen, D., & El-Roby, A. (2020). Dual mixup regularized learning for adversarial domain adaptation. In Europeon conference on computer vision (pp. 540-555).
Xia, H., Jing, T., & Ding, Z. (2022). Maximum structural generation discrepancy for unsupervised domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2022.3174526
Xia, H., Zhao, H., & Ding, Z. (2021). Adaptive adversarial network for source-free domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9010-9019).
Xu, R., Li, G., Yang, J., & Lin, L. (2019). Larger norm more transferable: An adaptive feature norm approach for unsupervised domain adaptation. In IEEE international conference on computer vision (pp. 1426-1435).
Yang, S., van de Weijer, J., Herranz, L., Jui, S., et al. (2021). Exploiting the intrinsic neighborhood structure for source-free domain adaptation. In Advances in neural information processing systems (pp. 532542).
Yang, S., Wang, Y., van de Weijer, J., Herranz, L., & Jui, S. (2020). Unsupervised domain adaptation without source data by casting a bait. Retrieved from arXiv preprint. arXiv:2010.12427
Yue, Z., Sun, Q., Hua, X.-S., & Zhang, H. (2021). Transporting causal mechanisms for unsupervised domain adaptation. In IEEE/CVF international conference on computer vision (pp. 8599-8608).
Zhang, J., Huang, J., Tian, Z., & Lu, S. (2022). Spectral unsupervised domain adaptation for visual recognition. In IEEE conference on computer vision and pattern recognition (pp. 9829-9840).
Zhang, Y., Tang, H., Jia, K., & Tan, M. (2019). Domain-symmetric networks for adversarial domain adaptation. In IEEE conference on computer vision pattern recognition (pp. 5031-5040).
Zhou, S., Wang, L., Zhang, S., Wang, Z., & Zhu, W. (2022). Active gradual domain adaptation: Dataset and approach. IEEE Transactions on Multimedia, 24, 1210–1220. https://doi.org/10.1109/TMM.2022.3142524
ZongxianLee. (2019). A pytorch implementation of maximum mean discrepancies (MMD) loss. https://github.com/ZongxianLee/MMD_Loss.Pytorch
Acknowledgements
This work is partly funded by the German Research Foundation (DFG) and National Natural Science Foundation of China in project Crossmodal Learning under contract Sonderforschungsbereich Transregio 169, the Hamburg Landesforschungsförderungsprojekt Cross, NSFC (61773083); Horizon2020 RISE project STEP2DYNA (691154); the National Key R &D Program of China (2020YFB1313600); the National Natural Science Foundation of China (62206168, 62276048, U1813202).
Author information
Authors and Affiliations
Corresponding authors
Additional information
Communicated by Yasuyuki Matsushita
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
A Proof of Theorem 1
The gradual self-training is the main learning framework for the gradual domain adaptation problem (Zhou et al., 2022). The article of (Kumar et al., 2020) gives the theoretical results of its single-step transfer performance. Since the conclusion provides the basis for our theoretical analysis, here, we first present a brief introduction as follows.
Suppose any single-step transfer process of gradual self-training can be formulated as
where \(D_i\) stands for the unlabeled samples from the distribution \(P_i\) to be learned, \(\theta _{i-1}\) is the trained model representing distribution \(P_{i-1}\), the learning result of this step is \(\theta _{i}\) representing \(P_i\). We have the following theorem for the adaptation upper bound of the single-step transfer process.
Theorem 2
(Kumar et al., 2020) Given distribution P, Q with Wasserstein-infinity distance-based measure \(\mathrm{{D}}_w\left( P, Q \right) =\pi < 1/R\) (R stands for the regularization strength of models to be learned) and marginals on Y are the same so \(P(Y)=Q(Y)\) (no label shift). Suppose P, Q satisfy the bounded data assumption, and we have initial model \(\theta \) with objective loss \(L\left( \theta , P \right) \), and n unlabeled samples S from Q, and we set \(\theta '=\mathrm{{SST}}(\theta ,S)\) letting objective loss \(L\left( \theta ', Q \right) < \alpha ^*\) (\(\alpha ^*\) is a given small loss), then
This theorem applies to our formulation which could be considered as a special case of gradual self-training due to the following properties:
-
(1)
Sharing the same transfer strategy: Leveraging a series of intermediate probability distributions that shift smoothly between the source and target domains.
-
(2)
Sharing the same epoch-wise training method: The whole training is sliced into successive epochs, with each epoch realizing a single-step search of an intermediate distribution under the guidance of the previous intermediate distribution.
-
(3)
Each \(D_i\) in the original gradual self-training refers to a different set of samples. In our TPDS, we approximate \(D_i\) with a subset of the target samples constructed by the hard samples grouped by the so-far trained model representing the previous adjacent intermediate distribution.
With Theorem 2, we further prove the following theorem about the transfer performance upper bound of our progressive searching.
Restatement of Theorem 1
Suppose the distributions in the proxy distribution flow \(\{ {P_{\theta _{k}}} \}_{k=0}^{K}\) satisfy no label shift (the C categories are fixed) and the data is bounded (the data is not too large: \(\Vert \varvec{x}_i\Vert _2^2 \le \rho \), \(\rho >0\) for \( 1 \le i \le n\)). Distribution shifts in this flow are \(\Pi = \{ \pi _k \}_{k=1}^{K}\) where \(\pi _k\) change gradually from 1 to K, and \(\pi _m = \max (\Pi )\). If the source model \(\theta _s=\theta _0\) has low loss \(\alpha _0 \ge \alpha ^*\) on the source domain, then
where \(L\left( {{\theta _K},{P_{{\theta _K}}}} \right) \) is the objective loss as learning \({\theta _K}\) for \({P_{{\theta _K}}}\) prediction, R stands for the regularization strength of \({\theta _K}\), \(\alpha ^{*}\) is a given small loss, n is the target data number. \(\square \)
Proof of Theorem 1
We start with the source model \(\theta _s\) with loss \(\alpha _0\). Applying Theorem 2 for each proxy distribution search driven by an adjacent distributions alignment in the flow, considering \(\pi _m = \max (\Pi ) \ge \pi _k\) for \(1 \le k \le K\), we have
Expanding, it becomes the sum of a geometric series. Due to \(\alpha ^* \le \alpha _0\), according to the formula for the sum of a geometric series, we obtain
\(\square \)
B Objective for Source Model Training
For all transfer tasks on the three datasets, we train the source model \(\theta _s\) on the source domain in a supervised manner using the following objective of the classic cross-entropy loss.
where \(n_s\) is the number of the source data, \({p}_{i,c}^s\) is the c-th element of \(\varvec{p}_i^s=\theta _s({\varvec{x}}_{i}^{s})\) that is the category probability vector of input instance \({\varvec{x}}_{i}^{s}\) after \(\theta _s\) mapping; \(\tilde{{l}}_{i,c}^s\) is the c-th element of the smooth label (Muller et al., 2019) \(\tilde{\varvec{l}}_i^s=(1-\sigma ){\hspace{2.0pt}}\varvec{l}_i^s + \sigma /C\), in which \(\varvec{l}_i^s\) is a one-hot encoding of hard label \(y_i^s\) and \(\sigma =0.1\).
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Tang, S., Chang, A., Zhang, F. et al. Source-Free Domain Adaptation via Target Prediction Distribution Searching. Int J Comput Vis 132, 654–672 (2024). https://doi.org/10.1007/s11263-023-01892-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-023-01892-w