Introduction

Deep neural networks (DNNs) have shown remarkable achievements in a wide range of computer vision problems [1, 2]. However, the impressive success heavily relies on an amount of labeled training data and still suffers poor generalization performance to other emerging application domains because of the domain shift problem [3, 4]. To handle these problems, much unsupervised domain adaptation (UDA) methods [5, 6] are proposed to transfer knowledge from a labeled source domain to a different unlabeled target domain, which can be divided into main two categories: domain alignment methods [7,8,9] and adversarial learning methods [10, 11]. However, these methods mainly adopt CNN backbone (e.g., ResNet [12]) to learn class-level alignment, which is not robust to the generalized large-scale datasets (e.g., VisDA-2017 [13]) for satisfactory performance. For example, the ResNet-101 average results is only 52.4% [14] on VisDA-2017.

Recently, instead of CNN-based architecture, Vision Transformer (ViT) [15] is a more powerful backbone and has been used to improve transferable performance on UDA tasks. For example, Yang et al. [16] proposed a transferable vision transformer method to investigate the transferability of ViT. Sun et al. [17] proposed transformer-based self-refinement SSRT to refine the domain adaptation model. Xu et al. [18] explored cross-domain transformer for unsupervised domain adaptation. However, these ViT-based methods do not take full advantage of different complementary supervision information to improve UDA performance, e.g., global class-level representation, local pixel-level, and patch-level representation.

With the above discussions, we focus on solving UDA tasks from three aspects:

First, from the global class-level representation aspect, masked image modeling [19], which trains the model to predict missing information from masked image patches, has recently revealed very strong representation learning performance. We follow ViT [15] to divide it into regular non-overlapping patches. As shown in Fig. 1, given an image, we randomly mask some proportion of image patches and replace them with a special mask embedding [MSK], which aims to learn local pixel-level representation at masked positions using missing information of contextual patches. In addition, a [CLS] token is a special symbol added to extract global information, which is helpful to classifier prediction. Specifically, we extend the masked self-distillation framework to integrate the vision transformer to UDA, thus [CLS] embedding is taken to learn discriminative and global task-specific class prediction for UDA.

Second, from the local pixel-level representation aspect, most of the existing UDA methods ignore another source of supervision obtained from the raw target domain images. We follow SimMIM [19] and simply predict the RGB pixel values for masked patches, which aims to learn local pixel-level representation at masked positions using missing information of contextual patches.

Third, from the local patch-level representation aspect, we aim to bridge the two sources of gap (i.e., cross-domain adaptation alignment) such as the local [MSK] features can improve the global [CLS] feature for better classification performance. So we introduce an adversarial weighted cross-domain adaptation objective to capture the discriminative potentials of patch tokens to learn both transferable and discriminative domain-specific patch-level representation.

Based on these three aspects, in this paper, we propose a novel UDA solution named TAMS (Transferable Adversarial Masked Self-distillation for UDA). As shown in Fig. 1, TAMS takes a vision transformer as the backbone network and utilizes a self-distillation framework for unsupervised domain adaptation. Different from existing ViT-based methods [16,17,18], TAMS jointly optimizes three key designs to take global and local representation into account, which effectively exploits the transferability of ViT to improve UDA performance.

In summary, the major contributions of this work are:

  1. 1.

    We present a novel masked self-distillation framework for UDA, called TAMS. Specifically, TAMS distills representation from a full image to the representation predicted from a masked image, which effectively introduces the masked self-distillation of ViT in transferring knowledge on the UDA task.

  2. 2.

    TAMS jointly optimizes three objectives to transfer knowledge for UDA task: (1) adversarial masked self-distillation objective to learn task-specific global class-level representation; (2) masked image modeling objective to learn local pixel-level representation; (3) adversarial weighted cross-domain adaptation objective to learn domain-specific patch-level representation.

  3. 3.

    Extensive experiments are conducted on widely tested benchmarks. TAMS achieves competitive performance compared to state-of-the-art methods including 94.18% on Office-31, 85.63% on Office-Home and 88.38% VisDA-2017.

Fig. 1
figure 1

The overall framework of TAMS. We introduce a masked self-distillation teacher-student network to learn the latent representation for the target domain, where the student branch consists of encoder–decoder architecture that feeds masked images, while the teacher branch contains an encoder to produce latent representation and updates weights from the student network using EMA. We randomly replace image patches with [MSK] tokens on target images, where a mask token [MSK] is taken to learn local pixel-level representation at masked positions using missing information of contextual patches while a [CLS] token is a special symbol added to extract global information. In our method, we jointly optimize three objectives to transfer knowledge for the UDA task: (1) adversarial masked self-distillation objective (“Adversarial masked self-distillation”) to learn global task-specific class-level global representation; (2) masked image modeling objective (“Masked image modeling”) to learn local pixel-level representation; (3) adversarial weighted cross-domain adaptation objective (“Adversarial weighted cross-domain adaptation”) to learn domain-specific patch-level representation. Denote that the labeled data in the source domain are used to train the student branch, which has been omitted

Related work

Unsupervised domain adaptation Traditional machine learning assumes that training and test data are drawn from the same distribution. However, in practical application scenarios, the target data usually follows a different distribution from the training source data. Thus, transfer learning [20] has been used to transfer generalized knowledge across different domains based on different distributions, i.e., unsupervised domain adaptation (UDA), where no labels are available for the target domain. Existing UDA methods can be roughly divided into two categories: domain-level methods [6, 8, 21,22,23] and class-level methods [24,25,26,27]. Domain-level methods aim to align distribution between the source and target domain using different measures such as Maximum Mean Discrepancy (MMD) [7, 8, 28] and Correlation Alignment (CORAL) [29, 30]. Another line of effort was introduced on the fine-grained class-level label distribution alignment by adversarial learning [31], which focuses on learning domain-invariant representations by a feature extractor and a domain discriminator [32, 33]. Unlike coarse-grained domain-level alignment, class-level aligns each category distribution between the source and target domain data. Different from the above two methods, our methods adopt ViT to simultaneously take class-level, patch-level and pixel-level into account, which exploits fine-grained alignment on Transformer by adversarial cross-domain adaptation objective.

Vision transformer for UDA Many Vision Transformer (ViT) [15] variants have been applied successfully to various vision tasks such as image classification, object detection and segmentation, which models long-range dependencies among visual features by self-attention mechanism. To improve UDA performance, many ViT-based methods have been proposed. For example, Sun et al. [17] proposed transformer-based self-refinement SSRT to refine the domain adaptation model. Yang et al. [16] proposed a transferable vision transformer method to investigate the transferability of ViT. Xu et al. [18] explored cross-domain transformer for unsupervised domain adaptation. However, these methods neglect to take full advantage of different complementary supervision information, e.g., masked image modeling [19], which can introduce mask image modeling into UDA to improve UDA performance. We present a novel masked self-distillation framework to jointly optimize three objectives to transfer knowledge for UDA tasks.

The proposed method

We first take problem formulation in “Problem formulation”, then we introduce the proposed method TAMS, where we jointly optimize three objectives to transfer knowledge for the UDA task: (1) adversarial masked self-distillation objective to learn global task-specific class-level global representation; (2) masked image modeling objective to learn local pixel-level representation; (3) adversarial cross-domain adaptation objective to learn domain-specific patch-level representation. The overall framework has been shown in Fig. 1.

Problem formulation

In UDA, there is a source domain with labeled data \({{\mathcal {D}}_{\text {s}}} = \{ (x_i^s,y_i^s)\} _{i = 1}^{{n_s}}\) from \({\mathcal {X}} \times {\mathcal {Y}}\) and a target domain with unlabeled data \({{\mathcal {D}}_t} = \{ (x_i^t)\} _{i = 1}^{{n_t}}\) from \({\mathcal {X}}\), where \({\mathcal {X}}\) is the input space and \({\mathcal {Y}}\) is the label space. We employ a ViT encoder \({G_f}\) for feature learning, and a classifier \({G_c}\) for classification. We can derive cross-entropy loss for the source domain:

$$\begin{aligned} {{\mathcal {L}}_{\textrm{cls}}}({x^s},{y^s}) = \frac{1}{{{n_s}}}\sum \limits _{{x_i} \in {{\mathcal {D}}_s}} {{{\mathcal {L}}_{\textrm{ce}}}({G_c}({G_f}(x_i^s)),y_i^s)}. \end{aligned}$$
(1)

Adversarial masked self-distillation

Following the typical adversarial adaptation method [31], let \({D_g}\) be a domain discriminator for global feature alignment, which is applied to the output of [CLS] token between the source and target domain. For adversarial domain adaptation [32, 33], \({G_f}\) and \({D_g}\) play a minimax game: \({G_f}\) uses domain-invariant features to rival, while \({D_g}\) discriminate source-domain features from target-domain. The adversarial objective can be formulated as:

$$\begin{aligned} {{\mathcal {L}}_{\textrm{adv}}}({x^s},{x^t}) = - \frac{1}{{{n_s} + {n_t}}}\sum \limits _{{x_i} \in {{\mathcal {D}}_s} \cup {{\mathcal {D}}_t}} {{{\mathcal {L}}_{\textrm{ce}}}({D_g}({G_f}(x_i^*)),y_i^d)},\nonumber \\ \end{aligned}$$
(2)

where \({{\mathcal {L}}_{\textrm{ce}}}\) is the cross-entropy loss and the superscript \(*\) can be either the source domain s or the target domain t. When \(y_i^d = 1\) denotes the source domain labels and \(y_i^d = 0\) denotes the target domain labels.

To leverage more complimentary information, we introduce self-distillation [34, 35] to fully utilize previous knowledge to drive the model itself training. This idea also has been used to solve UDA tasks, e.g., pseudo-label learning [36] and self-ensemble learning [37]. Here, we introduce a masked self-distillation teacher–student network to produce the latent representation for the target domain, where the student branch consists of encoder–decoder architecture that feeds masked images, while the teacher branch contains an encoder to produce latent representation and updates weights \(\hat{\theta }\) from the student network parameter \(\theta \) using Exponential Moving Average (EMA) [38], i.e., \(\hat{\theta }= \mu \hat{\theta }+ (1 - \mu )\theta \). Following FixMatch [39], we set \(\mu \) to 0.99, and we can introduce self-distillation loss to rectify target domain labels with maximum scores above a threshold \(\lambda \) [37]:

$$\begin{aligned} {{\mathcal {L}}_{{\text {self-KD}}}}({x^t};\theta ) = \frac{1}{B}\sum _{b = 1}^B {\mathbb {I}} (\max {q_b} \geqslant \lambda ){{\mathcal {L}}_{\textrm{ce}}}({p_b},{\hat{q}_b}), \end{aligned}$$
(3)

where \({p_b}\) is the prediction of the student branch, \({q_b}\) is the prediction of the teacher branch, \({\hat{q}_b} = \arg \max {q_b}\) and B is the number of batch on unlabeled target domain. In our experiments, we set \(\lambda \) to the 0.5. So we can derive adversarial masked self-distillation objection as

$$\begin{aligned} {{\mathcal {L}}_{{\text {mask-KD}}}}({x^s},{x^t}) = {{\mathcal {L}}_{\textrm{adv}}}({x^s},{x^t}) + {{\mathcal {L}}_{{\text {self-KD}}}}({x^t};\theta ), \end{aligned}$$
(4)

where \({{\mathcal {L}}_{{\text {mask-KD}}}}\) not only considers model itself complimentary information, but also can learn task-specific global class-level representation.

Masked image modeling

Although Eq. (4) can learn task-specific global class-level representation, Eq. (3) may introduce noisy pseudo-label into training. Thus we introduce another complimentary information from the raw image to learn local pixel-level representation. We follow SimMiM [19] and predict the RGB pixel values for masked patches. Specifically, given the output embedding \(z_b^m\) of the m-th [MSK] token, we first input it into a linear decode head to generate the predicted RGB values \(y_b^m \in {{\mathcal {R}}^K}\) for the patch, where K denotes the number of RGB pixels per patch. Then masked image modeling objective can be formulated as

$$\begin{aligned} {{\mathcal {L}}_{{\text {MIM}}}}({x^t};\theta ) = \frac{1}{{BMK}}{\sum _{b = 1}^B \sum _{m = 1}^M {\left\| {y_b^m - x_b^m} \right\| }_1}, \end{aligned}$$
(5)

where M denotes the number of masked patches per image, and \(x_b^m\) denote the ground-truth RGB values. In our experiments, we also explore the influence of different masking ratios \(\varepsilon \) in “Experiments”.

Fig. 2
figure 2

The weighted cross-attention for Transformer. Using entropy \(H({D_p}({K^{\textrm{patch}}}))\) to assign weights to different patches, the cross-attention mechanism (Eq. (6)) can not only capture semantic importance \(({\text {softmax}} (\frac{{Q{K^T}}}{{\sqrt{d} }}))\) on attention map, but also can capture the transferability of each patch token

Adversarial weighted cross-domain adaptation

Let (HW) denote the resolution of the original image, C is the number of channels and (PP) denotes the resolution of each image patch. The number of patches can be computed by \(N = HW/{P^2}\). For the self-attention of Transformer, the patches are projected into three vectors, i.e., queries \(\mathbf{{Q}} \in {{\mathcal {R}}^{N \times d}}\), keys \(\mathbf{{K}} \in {{\mathcal {R}}^{N \times d}}\) and values \(\mathbf{{V}} \in {{\mathcal {R}}^{N \times d}}\), which aims to emphasize relationships among patches by computing inner product. Different from traditional self-attention, inspired by [16, 18], as shown in Fig. 2, we leverage the weighted cross-attention to learn mix-up feature representations for both source and target domains, which can be formulated as:

$$\begin{aligned}{} & {} \!\!\!\text {Attn}_{{\textrm{cross}}}({x^s},{x^t}) \nonumber \\{} & {} \quad = \underbrace{{\textrm{softmax}} \left( \frac{{{Q_s}K_t^T}}{{\sqrt{d} }}\right) \times \left[ \begin{array}{l} 1\\ H({D_p}(K_t^{\textrm{patch}})) \end{array} \right] {V_t}}_{s \rightarrow t}\nonumber \\{} & {} \qquad +\underbrace{{\textrm{softmax}} \left( \frac{{{Q_t}K_s^T}}{{\sqrt{d} }}\right) \times \left[ \begin{array}{l} 1\\ H({D_p}(K_s^{\textrm{patch}})) \end{array} \right] {V_s}}_{t \rightarrow s}, \end{aligned}$$
(6)

where \({D_p}\) is introduced for patch-level domain discriminator to match cross-domain local features. Here, instead of using masked target vector (\(\mathbf{{Q}} \in {{\mathcal {R}}^{N \times d}}\), keys \(\mathbf{{K}} \in {{\mathcal {R}}^{N \times d}}\) and values \(\mathbf{{V}} \in {{\mathcal {R}}^{N \times d}}\)), we use the complete target vector to prevent pixel loss and thus achieve cross attention. Using entropy \(H({D_p}({K^{\textrm{patch}}}))\) to assign weights to different patches, the cross-attention in Eq. (6) not only considers semantic importance \(({\text {softmax}} (\frac{{Q{K^T}}}{{\sqrt{d} }}))\) but also considers the transferability of each patch token. As shown in Eq. (2), we can introduce adversarial cross-domain adaptation objective as

$$\begin{aligned}{} & {} \!\!\!{{\mathcal {L}}_{\textrm{patch}}}({x^s},{x^t})\nonumber \\{} & {} \quad = - \frac{1}{{({n_s} + {n_t})P}}\sum \limits _{{x_i} \in {{\mathcal {D}}_s} \cup {{\mathcal {D}}_t}} {\sum _{j = 1}^P {{{\mathcal {L}}_{\textrm{ce}}}({D_p}({G_f}(x_{ij}^*)),y_{ij}^d)} },\nonumber \\ \end{aligned}$$
(7)

where P is the number of patches. Following the adversarial learning, \({D_p}\) tries to assign 1 for a source-domain patch and 0 for the target-domain patch, while \({G_f}\) combats such situations. As a result, adversarial cross-domain adaptation manages to aggregate the two input images based on different weights of patches, which aligns the information from patch-level representation.

Following the literature [16, 40, 41], we also introduce maximizing mutual information on target label distribution to avoid every target data being assigned to the same class as

$$\begin{aligned} I({p^t};{x^t}) = H({\bar{p}^t}) - \frac{1}{{{n_t}}}\sum _{i = 1}^{{n_t}} {H(p_i^t)}, \end{aligned}$$
(8)

where \(p_i^t = {\text {softmax}} ({G_c}({G_f}(x_i^t))\) and \({\bar{p}^t} = {{\mathbb {E}}_{{x_t}}}[{p^t}]\), by maximizing mutual information, our model can further improve the generalization to the target domain.

To summarize, the objective function of TAMS can be defined as

$$\begin{aligned} {{\mathcal {L}}_{\textrm{TAMS}}}= & {} {{\mathcal {L}}_{\textrm{cls}}}({x^s},{y^s}) + \alpha {{\mathcal {L}}_{{\text {mask-KD}}}}({x^s},{x^t}) + \beta {{\mathcal {L}}_{{\text {MIM}}}}({x^t};\theta )\nonumber \\{} & {} \quad +\gamma {{\mathcal {L}}_{\textrm{patch}}}({x^s},{x^t}) - \delta I({p^t};{x^t}), \end{aligned}$$
(9)

where \(\alpha \), \(\beta \), \(\gamma \) and \(\delta \) are hyper-parameters. In our experiments, \(\alpha \), \(\beta \) and \(\delta \) are set to 0.1, while \(\gamma \) is set to 0.01.

Experiments

In this section, we evaluate and analyze the proposed TAMS methods on benchmark datasets.

Experiment setup

Datasets To verify the effectiveness of TAMS, we conduct comprehensive experiments on benchmark datasets, including Office-31 [42], Office-Home [43] and VisDA-2017 [13]. Office-31 [42] contains 4,652 images of 31 classes from three domains: Amazon (A), DSLR (D), and Webcam (W). As shown in Table 1, ‘A \(\rightarrow \) W’ refers that A as the source domain and W as the target domain. Office-Home [43] consists of 15,500 images of 65 classes from four domains: Artistic (Ar), Clip Art (Cl), Product (Pr), and Real-world (Rw) images. VisDA-2017 [13] is a Synthetic-to-Real dataset, which contains about 0.2 million images in 12 classes.

Baseline methods We investigate experimental comparisons against state-of-the-art UDA methods, including RevGrad [10], DDC [7], JAN [44], MinEnt [45], DAN [8], DANN [10], CDAN [46], MCD [24], SWD [47], BNM [48], DCAN [49], SHOT [14], ATDOC-NA [50], ALDA [51], TVT [16], CDTrans [18] and SSRT [17]. For a fair comparison, we use the original results on paper. We also compare our method with different backbones such as ResNet-50 and ViT.

Implementation details In our experiments, we use the ViT-small (ViT-S) and ViT-base (ViT-B) with \(16 \times 16\) patch size [1], pre-trained on ImageNet, as the vision transformer backbones, which contain 12 transformer layers in the encoder. ViT-S (22 M parameters) is a comparable model with ResNet-50. In our experiments, the ’Baseline’ indicates training ViT-S/ViT-B with the adversarial domain adaptation method (i.e., Eq. (2)). We train ViT-S/ViT-B models using a mini-batch SGD optimizer with a momentum of 0.9, which initializes the learning rate as 0 and linearly increase it to \(lr = 0.03\) after 500 training steps except with \(lr = 0.003\) for \({\text {D}} \rightarrow {\text {A}} \) and \({\text {W}} \rightarrow {\text {A}} \) in Office-31 dataset. For another, we set the masking ratio \(\varepsilon \) to 0.4 for masked image modeling.

Experimental results

Table 1 Accuracy (%) on the Office-31 dataset
Table 2 Accuracy (%) on the Office-Home dataset
Table 3 Accuracy (%) on the VisDA-2017 dataset

The detailed experimental results have been shown in Tables 1, 2 and 3, we can observe that those transformer-based UDA methods perform better than ResNet-based UDA models, which shows that the transformer has more power than ResNet on UDA tasks. The detailed experimental analyses of different datasets have been shown below.

Office-31 results As shown in Table 1, the proposed TAMS outperforms the other ResNet-based UDA models and obtains better performance than the other ViT-based UDA methods. For example, on some transfer tasks (e.g., \({\text {D}} \rightarrow {\text {A}} \) and \({\text {W}} \rightarrow {\text {A}} \)), the proposed TAMS performs better than CDTrans [18], TVT [16] and SSRT-B [17].

Office-Home results As shown in Table 2, our method has the highest average accuracy of 85.63%. Compared with the ResNet-based models, the ViT-based models have a larger improvement. In particular, the proposed TAMS significantly improves the performance in all the transfer tasks. Compared with SSRT-B [17], our method performs better on some transfer tasks (e.g., Cl \(\rightarrow \) Pr and Rw \(\rightarrow \) Ar). When transferring to the CL domain, many methods have lower accuracy (e.g., CDTrans [18] and TVT [16]), our method can achieve better performance, which implicitly indicates that our method has better generalization ability on different transfer tasks.

VisDA-2017 results As shown in Table 3, our method achieves a better average accuracy of 88.4%. Compared with the other methods, our TAMS achieves the best performance on three classes including bicycle, horse, and skateboard.

Ablation study

We conduct ablation studies to verify the effects of the different loss functions, including \({{\mathcal {L}}_{\textrm{cls}}}({x^s},{y^s})\), \({{\mathcal {L}}_{\textrm{adv}}}({x^s},{x^t})\), \({{\mathcal {L}}_{{\text {self-KD}}}}({x^t};\theta )\), \({{\mathcal {L}}_{{\text {MIM}}}}({x^t};\theta )\), \({{\mathcal {L}}_{\textrm{patch}}}({x^s},{x^t})\) and \(I({p^t};{x^t})\). We also explore the effects of the different ViT encoders, including traditional self-attention with cross-attention of Eq. (6). Comparing the second, third and fourth rows in Table 4, we can see that self-distillation loss and masked image modeling are effective for ViT backbones in UDA tasks. Comparing the fourth and fifth rows, the patch-level adversarial loss further improves the performance of the Office-31 dataset. In addition, the proposed TAMS with cross-attention in Eq. (6) achieves better performance than traditional self-attention in ViT, which demonstrates the effectiveness of the weighted cross-attention mechanism used in TAMS.

Diversity analysis with masked image modeling

We examine the attention diversity of heads and observe whether masked image modeling (MIM) is conducive to improving the transferable performance of UDA. Following the [15], we use the average attention distance to measure whether it is local attention or global attention, which can partially reflect the receptive field size for each attention head. Figure 3 shows the average attention distance per head and layer depth using the ViT-B/16 architectures. We visualize the average attention distance between MIM and without MIM. It can be seen that: (1) Without MIM (the top row), the average attention distances of different heads in deeper layers collapse to locate within a very small distance range. This suggests that different heads learn very similar visual cues and may be wasting model capacity. (2) After using MIM (the bottom row), the attention representations become more diverse regarding the average attention distance, especially for deeper layers. As analyzed in [52], masked image modeling brings locality inductive bias to the trained model and more diversity on attention heads, which is useful to enhance the transferable performance of UDA.

Effect of different masking ratio

As shown in Fig. 4, we analyze the influence of different masking ratios \(\varepsilon \) on Office-31. When \(\varepsilon =0.4\), the model can obtain better test accuracy, which implicitly shows that it is beneficial to enhance the transferable performance of UDA by making full use of the supervision information of pixel-level in the target domain.

Table 4 Ablation study of each module

Effect of different masking strategies

For UDA tasks, we present a simple random masking strategy. Furthermore, we also study how different masking strategies affect the effectiveness of UDA. In our experiments, we try other masking strategies (i.e., square [53], block-wise [54], and random) with different masked patch sizes (i.e., 16 and 32). The detailed experimental results have been shown in Table 5. We first notice that the best test accuracy of our simple random masking strategy reaches 94.18%, which is + 0.12% higher than the best block-wise masking strategy. In addition, when a large masked patch size of 32 is adopted, the different masking strategies perform stably well on a small range of precision.

Table 5 Experiments on different masking strategies (i.e., square [53], block-wise [54], and random) with different masked patch sizes (i.e., 16 and 32), mask ratio is set to 0.4
Fig. 3
figure 3

The average attention distance in different attention heads (dots). Top row: TAMS does not use masked image modeling (MIM); Bottom row: TAMS uses masked image modeling (MIM)

Effect of adversarial cross-domain adaptation

To further verify the effectiveness of adversarial cross-domain adaptation (ACA), we use t-SNE to visualize the feature representation between with ACA and without ACA, as shown in Fig. 5. Blue and red points represent the source domain and target domain samples, respectively. It can be seen that without ACA, it is not well aligned such as A \(\rightarrow \) W and W \(\rightarrow \) A. In addition, when using ACA, TAMS has good feature alignment, as shown in W \(\rightarrow \) A, our method not only has compressed intra-class representations and separable inter-class representations but also can align feature distribution more effectively than without ACA. This shows that adversarial cross-domain adaptation can capture discriminative information and better alignment.

Fig. 4
figure 4

The influence of different masking ratios on Office-31. When \(\varepsilon =0.4\), the model can obtain better test accuracy

Fig. 5
figure 5

t-SNE visualization of feature alignment. Blue points are source samples, and red points are target samples. Top row: TAMS does not use adversarial cross-domain adaptation (Eq. (6)); Bottom row: TAMS uses adversarial cross-domain adaptation (Eq. (6))

Effect of masked self-distillation training

On UDA tasks, the predicted class distribution on target domain data may collapse, so we analyze whether our method is safety training. In [17], Sun et al. proposed a self-refinement strategy to avoid collapse. Our method can avoid such situations because of using the self-distillation of the teacher–student framework, where the teacher branch updates weights from the student network using the Exponential Moving Average (EMA). Following the [17], we also use the diversity curve of model predictions on the target domain to analyze whether our method is safety training. As shown in Fig. 6a, our method keeps stability training on different transfer tasks. In addition, Fig. 6b, c plot results of class-level adversarial and patch-level adversarial loss, our method can converge well on the training.

Fig. 6
figure 6

Training curve on three tasks (i.e., D \(\rightarrow \) W, A \(\rightarrow \) D and W \(\rightarrow \) A) of Office-31. a Plots of the diversity of model predictions on target domain data [17]; b class-level adversarial loss curve on Eq. (2); c patch-level adversarial loss curve on Eq. (7)

Fig. 7
figure 7

Parameter sensitivity analysis. When \(\gamma =0.01 \), the candidate sets for \(\alpha = \beta = \delta \) can be employed in \(\{ 0.01,0.1,1\} \) to obtain satisfactory performance. When \(\alpha = \beta = \delta =0.1 \), \(\gamma =0.01 \), three difficult tasks (i.e., A \(\rightarrow \) W, D \(\rightarrow \) A and W \(\rightarrow \) A) can obtain the best performance

Fig. 8
figure 8

Attention maps of images on Office-31 dataset. a\(\rightarrow \) A; b\(\rightarrow \) A. The hotter the color, the higher the attention

Parameter sensitivity analysis

We analyze the parameter sensitivity of TAMS by conducting experiments on Office-31 datasets. The parameters \(\alpha \), \(\beta \), \(\gamma \) and \(\delta \) were searched in \(\{ 0.01,0.1,1\} \). For the sake of simplicity, we set the parameter \(\alpha = \beta = \delta \) to take the same weights on three components and analyze the influence of different parameters \(\gamma \) on the performance. As shown in Fig. 7, it can be seen that when \(\gamma =0.01 \), the candidate sets for \(\alpha = \beta = \delta \) can be employed in \(\{ 0.01,0.1,1\} \) to obtain satisfactory performance. When \(\alpha = \beta = \delta =0.1 \), \(\gamma =0.01 \), three difficult tasks (i.e., A \(\rightarrow \) W, D \(\rightarrow \) A and W \(\rightarrow \) A) can obtain the best performance. Therefore, in our experiments, \(\alpha \), \(\beta \) and \(\delta \) are set to 0.1, while \(\gamma \) is set to 0.01. In addition, when \(\gamma =0.1 \), there is no significant decrease in test accuracy, which implicitly indicates that our method is not parameter sensitive.

Attention visualization analysis

We visualize the attention map using Grad-CAM [55] to verify that our model can capture important local regions, where Grad-CAM uses the gradients of any target concept flowing into the final layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept. We randomly sample Office-31 images from two more difficult tasks (i.e., \({\text {D}} \rightarrow {\text {A}} \) and \({\text {W}} \rightarrow {\text {A}} \)) to visualize the attention map. As shown in Fig. 8, the proposed TAMS captures more accurate regions than the baseline. For instance, in the keyboard and cup, the TAMS method owns more hot areas on the target object than the baseline method. In addition, we also compare the difference between TAMS without weights by entropy and using weights. It can be seen that TAMS with weights can focus more attention on hot regions, which promotes the transferability of ViT in UDA tasks.

Conclusion

In this paper, we propose the transferable adversarial masked self-distillation to improve the performance of UDA, which consists of three parts including adversarial masked self-distillation, masked image modeling, and adversarial cross-domain adaptation objective. The proposed TAMS simultaneously takes class-level, pixel-level, and patch-level representations into account. Experimental results show that the proposed TAMS outperforms existing state-of-the-art ResNet-based and ViT-based methods on three benchmark datasets. In future work, we will further apply TAMS to other computer vision tasks such as object detection and semantic segmentation.