Transferable adversarial masked self-distillation for unsupervised domain adaptation

Unsupervised domain adaptation (UDA) aims to transfer knowledge from a labeled source domain to a related unlabeled target domain. Most existing works focus on minimizing the domain discrepancy to learn global domain-invariant representation using CNN-based architecture while ignoring both transferable and discriminative local representation, e.g, pixel-level and patch-level representation. In this paper, we propose the Transferable Adversarial Masked Self-distillation based on Vision Transformer architecture to enhance the transferability of UDA, named TAMS. Specifically, TAMS jointly optimizes three objectives to learn both task-specific class-level global representation and domain-specific local representation. First, we introduce adversarial masked self-distillation objective to distill representation from a full image to the representation predicted from a masked image, which aims to learn task-specific global class-level representation. Second, we introduce masked image modeling objectives to learn local pixel-level representation. Third, we introduce an adversarial weighted cross-domain adaptation objective to capture discriminative potentials of patch tokens, which aims to learn both transferable and discriminative domain-specific patch-level representation. Extensive studies on four benchmarks and the experimental results show that our proposed method can achieve remarkable improvements compared to previous state-of-the-art UDA methods.


Introduction
Deep neural networks (DNNs) have shown remarkable achievements in a wide range of computer vision problems [1,2].However, the impressive success heavily relies on an amount of labeled training data and still suffers poor generalization performance to other emerging application domains because of the domain shift problem [3,4].To handle these problems, much unsupervised domain adaptation (UDA) methods [5,6] are proposed to transfer knowledge from a labeled source domain to a different unlabeled target domain, which can be divided into main two categories: domain alignment methods [7][8][9] and adversarial learning methods [10, First, from the global class-level representation aspect, masked image modeling [19], which trains the model to predict missing information from masked image patches, has recently revealed very strong representation learning performance.We follow ViT [15] to divide it into regular non-overlapping patches.As shown in Fig. 1, given an image, we randomly mask some proportion of image patches and replace them with a special mask embedding [MSK], which aims to learn local pixel-level representation at masked positions using missing information of contextual patches.In addition, a [CLS] token is a special symbol added to extract global information, which is helpful to classifier prediction.Specifically, we extend the masked self-distillation framework to integrate the vision transformer to UDA, thus [CLS] embedding is taken to learn discriminative and global taskspecific class prediction for UDA.
Second, from the local pixel-level representation aspect, most of the existing UDA methods ignore another source of supervision obtained from the raw target domain images.We follow SimMIM [19] and simply predict the RGB pixel values for masked patches, which aims to learn local pixel-level representation at masked positions using missing information of contextual patches.
Third, from the local patch-level representation aspect, we aim to bridge the two sources of gap (i.e., cross-domain adaptation alignment) such as the local [MSK] features can improve the global [CLS] feature for better classification performance.So we introduce an adversarial weighted crossdomain adaptation objective to capture the discriminative potentials of patch tokens to learn both transferable and discriminative domain-specific patch-level representation.
Based on these three aspects, in this paper, we propose a novel UDA solution named TAMS (Transferable Adversarial Masked Self-distillation for UDA).As shown in Fig. 1, TAMS takes a vision transformer as the backbone network and utilizes a self-distillation framework for unsupervised domain adaptation.Different from existing ViT-based methods [16][17][18], TAMS jointly optimizes three key designs to take global and local representation into account, which effectively exploits the transferability of ViT to improve UDA performance.
In summary, the major contributions of this work are: 1. We present a novel masked self-distillation framework for UDA, called TAMS.Specifically, TAMS distills representation from a full image to the representation predicted from a masked image, which effectively introduces the masked self-distillation of ViT in transferring knowledge on the UDA task.

Related work
Unsupervised domain adaptation Traditional machine learning assumes that training and test data are drawn from the same distribution.However, in practical application scenarios, the target data usually follows a different distribution from the training source data.Thus, transfer learning [20] has been used to transfer generalized knowledge across different domains based on different distributions, i.e., unsupervised domain adaptation (UDA), where no labels are available for the target domain.Existing UDA methods can be roughly divided into two categories: domain-level methods [6,8,[21][22][23] and class-level methods [24][25][26][27].Domain-level methods aim to align distribution between the source and target domain using different measures such as Maximum Mean Discrepancy (MMD) [7,8,28] and Correlation Alignment (CORAL) [29,30].Another line of effort was introduced on the fine-grained class-level label distribution alignment by adversarial learning [31], which focuses on learning domain-invariant representations by a feature extractor and a domain discriminator [32,33].Unlike coarse-grained domain-level alignment, class-level aligns each category distribution between the source and target domain data.Different from the above two methods, our methods adopt ViT to simultaneously take class-level, patchlevel and pixel-level into account, which exploits fine-grained alignment on Transformer by adversarial cross-domain adaptation objective.
Vision transformer for UDA Many Vision Transformer (ViT) [15] variants have been applied successfully to various vision tasks such as image classification, object detection and segmentation, which models long-range dependencies among visual features by self-attention mechanism.To improve UDA performance, many ViT-based methods have been proposed.For example, Sun et al. [17] proposed transformer-based self-refinement SSRT to refine the domain adaptation model.Yang et al. [16] proposed a transferable vision transformer method to investigate the transferability of ViT.Xu et al. [18] explored cross-domain transformer for unsupervised domain adaptation.However, these methods neglect to take full advantage of different complementary supervision information, e.g., masked image modeling [19], Fig. 1 The overall framework of TAMS.We introduce a masked selfdistillation teacher-student network to learn the latent representation for the target domain, where the student branch consists of encoder-decoder architecture that feeds masked images, while the teacher branch contains an encoder to produce latent representation and updates weights from the student network using EMA.We randomly replace image patches with [MSK] tokens on target images, where a mask token [MSK] is taken to learn local pixel-level representation at masked positions using missing information of contextual patches while a [CLS] token is a special symbol added to extract global information.In our method, which can introduce mask image modeling into UDA to improve UDA performance.We present a novel masked selfdistillation framework to jointly optimize three objectives to transfer knowledge for UDA tasks.

The proposed method
We first take problem formulation in "Problem formulation", then we introduce the proposed method TAMS, where we jointly optimize three objectives to transfer knowledge for the UDA task: (1) adversarial masked self-distillation objective to learn global task-specific class-level global representation; (2) masked image modeling objective to learn local pixellevel representation; (3) adversarial cross-domain adaptation objective to learn domain-specific patch-level representation.
The overall framework has been shown in Fig. 1.

Problem formulation
In UDA, there is a source domain with labeled data D s = {(x s i , y s i )} n s i=1 from X ×Y and a target domain with unlabeled data D t = {(x t i )} n t i=1 from X , where X is the input space and Y is the label space.We employ a ViT encoder G f for feature learning, and a classifier G c for classification.We can derive cross-entropy loss for the source domain:

Adversarial masked self-distillation
Following the typical adversarial adaptation method [31], let D g be a domain discriminator for global feature alignment, which is applied to the output of [CLS] token between the source and target domain.For adversarial domain adaptation [32,33], G f and D g play a minimax game: G f uses domain-invariant features to rival, while D g discriminate source-domain features from target-domain.The adversarial objective can be formulated as: 123 where L ce is the cross-entropy loss and the superscript * can be either the source domain s or the target domain t.When y d i = 1 denotes the source domain labels and y d i = 0 denotes the target domain labels.
To leverage more complimentary information, we introduce self-distillation [34,35] to fully utilize previous knowledge to drive the model itself training.This idea also has been used to solve UDA tasks, e.g., pseudo-label learning [36] and self-ensemble learning [37].Here, we introduce a masked self-distillation teacher-student network to produce the latent representation for the target domain, where the student branch consists of encoder-decoder architecture that feeds masked images, while the teacher branch contains an encoder to produce latent representation and updates weights θ from the student network parameter θ using Exponential Moving Average (EMA) [38], i.e., θ = μ θ + (1 − μ)θ .Following FixMatch [39], we set μ to 0.99, and we can introduce self-distillation loss to rectify target domain labels with maximum scores above a threshold λ [37]: where p b is the prediction of the student branch, q b is the prediction of the teacher branch, qb = arg max q b and B is the number of batch on unlabeled target domain.In our experiments, we set λ to the 0.5.So we can derive adversarial masked self-distillation objection as where L mask-KD not only considers model itself complimentary information, but also can learn task-specific global class-level representation.

Masked image modeling
Although Eq. ( 4) can learn task-specific global class-level representation, Eq. ( 3) may introduce noisy pseudo-label into training.Thus we introduce another complimentary information from the raw image to learn local pixel-level representation.We follow SimMiM [19] and predict the RGB pixel values for masked patches.Specifically, given the output embedding z m b of the m-th [MSK] token, we first input it into a linear decode head to generate the predicted RGB values y m b ∈ R K for the patch, where K denotes the number of RGB pixels per patch.Then masked image modeling objective can be formulated as where M denotes the number of masked patches per image, and x m b denote the ground-truth RGB values.In our experiments, we also explore the influence of different masking ratios ε in "Experiments".

Adversarial weighted cross-domain adaptation
Let (H , W ) denote the resolution of the original image, C is the number of channels and (P, P) denotes the resolution of each image patch.The number of patches can be computed by N = H W /P 2 .For the self-attention of Transformer, the patches are projected into three vectors, i.e., queries Q ∈ R N ×d , keys K ∈ R N ×d and values V ∈ R N ×d , which aims to emphasize relationships among patches by computing inner product.Different from traditional self-attention, inspired by [16,18], as shown in Fig. 2, we leverage the weighted cross-attention to learn mix-up feature representations for both source and target domains, which can be formulated as: where D p is introduced for patch-level domain discriminator to match cross-domain local features.Here, instead of using masked target vector (Q ∈ R N ×d , keys K ∈ R N ×d and values V ∈ R N ×d ), we use the complete target vector to prevent pixel loss and thus achieve cross attention.Using entropy H (D p (K patch )) to assign weights to different patches, the cross-attention in Eq. ( 6) not only considers semantic importance (softmax( QK T √ d )) but also considers the transferability of each patch token.As shown in Eq. ( 2), we can introduce adversarial cross-domain adaptation objective as where P is the number of patches.Following the adversarial learning, D p tries to assign 1 for a source-domain patch and 0 for the target-domain patch, while G f combats such situations.As a result, adversarial cross-domain adaptation manages to aggregate the two input images based on different weights of patches, which aligns the information from patch-level representation.)) on attention map, but also can capture the transferability of each patch token Following the literature [16,40,41], we also introduce maximizing mutual information on target label distribution to avoid every target data being assigned to the same class as where ], by maximizing mutual information, our model can further improve the generalization to the target domain.
To summarize, the objective function of TAMS can be defined as where α, β, γ and δ are hyper-parameters.In our experiments, α, β and δ are set to 0.1, while γ is set to 0.01.

Experiments
In this section, we evaluate and analyze the proposed TAMS methods on benchmark datasets.
Office-31 [42] contains 4,652 images of 31 classes from three domains: Amazon (A), DSLR (D), and Webcam (W).As shown in Table 1, 'A → W' refers that A as the source domain and W as the target domain.Office-Home [43] consists of 15,500 images of 65 classes from four domains: Artistic (Ar), Clip Art (Cl), Product (Pr), and Real-world (Rw) images.
VisDA-2017 [13] is a Synthetic-to-Real dataset, which contains about 0.2 million images in 12 classes.

Implementation details
In our experiments, we use the ViTsmall (ViT-S) and ViT-base (ViT-B) with 16 × 16 patch size [1], pre-trained on ImageNet, as the vision transformer backbones, which contain 12 transformer layers in the encoder.ViT-S (22 M parameters) is a comparable model with ResNet-50.In our experiments, the 'Baseline' indicates training ViT-S/ViT-B with the adversarial domain adaptation method (i.e., Eq. ( 2)).We train ViT-S/ViT-B models using a minibatch SGD optimizer with a momentum of 0.9, which initializes the learning rate as 0 and linearly increase it to lr = 0.03 after 500 training steps except with lr = 0.003 for D → A and W → A in Office-31 dataset.For another, we set the masking ratio ε to 0.4 for masked image modeling.

Experimental results
The detailed experimental results have been shown in Tables 1, 2 and 3, we can observe that those transformerbased UDA methods perform better than ResNet-based UDA models, which shows that the transformer has more power than ResNet on UDA tasks.The detailed experimental analyses of different datasets have been shown below.

Office-31 results
As shown in Table 1, the proposed TAMS outperforms the other ResNet-based UDA models and obtains better performance than the other ViT-based UDA The best results have been shown in bold face methods.For example, on some transfer tasks (e.g., D → A and W → A), the proposed TAMS performs better than CDTrans [18], TVT [16] and SSRT-B [17].

Office-Home results
As shown in Table 2, our method has the highest average accuracy of 85.63%.Compared with the ResNet-based models, the ViT-based models have a larger improvement.In particular, the proposed TAMS significantly improves the performance in all the transfer tasks.Compared with SSRT-B [17], our method performs better on some transfer tasks (e.g., Cl → Pr and Rw → Ar).When transferring to the CL domain, many methods have lower accuracy (e.g., CDTrans [18] and TVT [16]), our method can achieve better performance, which implicitly indicates that our method has better generalization ability on different transfer tasks.

VisDA-2017 results
As shown in Table 3, our method achieves a better average accuracy of 88.4%.Compared with the other methods, our TAMS achieves the best performance on three classes including bicycle, horse, and skateboard.

Ablation study
We conduct ablation studies to verify the effects of the different loss functions, including L cls (x s , y s ), L adv (x s , x t ), L self-KD (x t ; θ), L MIM (x t ; θ), L patch (x s , x t ) and I ( p t ; x t ).
We also explore the effects of the different ViT encoders, including traditional self-attention with cross-attention of Eq. ( 6).Comparing the second, third and fourth rows in Table 4, we can see that self-distillation loss and masked image modeling are effective for ViT backbones in UDA tasks.Comparing the fourth and fifth rows, the patch-level adversarial loss further improves the performance of the Office-31 dataset.In addition, the proposed TAMS with cross-attention in Eq. ( 6) achieves better performance than traditional self-attention in ViT, which demonstrates the effectiveness of the weighted cross-attention mechanism used in TAMS.

Diversity analysis with masked image modeling
We examine the attention diversity of heads and observe whether masked image modeling (MIM) is conducive to improving the transferable performance of UDA.Following the [15], we use the average attention distance to measure whether it is local attention or global attention, which can partially reflect the receptive field size for each attention head.Figure 3 shows the average attention distance per head and layer depth using the ViT-B/16 architectures.We visualize the average attention distance between MIM and without MIM.It can be seen that: (1) Without MIM (the top row), the average attention distances of different heads in deeper layers collapse to locate within a very small distance range.This suggests that different heads learn very similar visual cues and may be wasting model capacity.(2) After using MIM (the bottom row), the attention representations become more diverse regarding the average attention distance, especially for deeper layers.As analyzed in [52], masked image modeling brings locality inductive bias to the trained model and more diversity on attention heads, which is useful to enhance the transferable performance of UDA.

Effect of different masking ratio
As shown in Fig. 4, we analyze the influence of different masking ratios ε on Office-31.When ε = 0.4, the model can obtain better test accuracy, which implicitly shows that it is beneficial to enhance the transferable performance of UDA by making full use of the supervision information of pixel-level in the target domain.

Effect of different masking strategies
For UDA tasks, we present a simple random masking strategy.Furthermore, we also study how different masking strategies affect the effectiveness of UDA.In our experiments, we try other masking strategies (i.e., square [53], block-wise [54], and random) with different masked patch sizes (i.e., 16 and 32).The detailed experimental results have been shown in Table 5.We first notice that the best test accuracy of our simple random masking strategy reaches 94.18%, which is + 0.12% higher than the best block-wise masking strategy.In addition, when a large masked patch size of 32 is adopted, the different masking strategies perform stably well on a small range of precision.

Effect of adversarial cross-domain adaptation
To further verify the effectiveness of adversarial crossdomain adaptation (ACA), we use t-SNE to visualize the feature representation between with ACA and without ACA, as shown in Fig. 5. Blue and red points represent the source domain and target domain samples, respectively.It can be seen that without ACA, it is not well aligned such as A → W and W → A. In addition, when using ACA, TAMS has good feature alignment, as shown in W → A, our method not only has compressed intra-class representations and separable inter-class representations but also can align feature distribution more effectively than without ACA.This shows that adversarial cross-domain adaptation can capture discriminative information and better alignment.

Parameter sensitivity analysis
We analyze the parameter sensitivity of TAMS by conducting experiments on Office-31 datasets.The parameters α, β, γ and δ were searched in {0.01, 0.1, 1}.For the sake of simplicity, we set the parameter α = β = δ to take the same weights on three components and analyze the influence of different parameters γ on the performance.As shown in Fig. 7, it can   6)); Bottom row: TAMS uses adversarial cross-domain adaptation (Eq.( 6)) δ are set to 0.1, while γ is set to 0.01.In addition, when γ = 0.1, there is no significant decrease in test accuracy, which implicitly indicates that our method is not parameter sensitive.

Attention visualization analysis
We visualize the attention map using Grad-CAM [55] to verify that our model can capture important local regions, where Grad-CAM uses the gradients of any target concept flowing into the final layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept.We randomly sample Office-31 images from two more difficult tasks (i.e., D → A and W → A) to visualize the attention map.As shown in Fig. 8, the proposed TAMS captures more accurate regions than the baseline.For instance, in the keyboard and cup, the TAMS method owns more hot areas on the target object than the baseline method.
In addition, we also compare the difference between TAMS without weights by entropy and using weights.It can be seen that TAMS with weights can focus more attention on hot regions, which promotes the transferability of ViT in UDA tasks.

Conclusion
In this paper, we propose the transferable adversarial masked self-distillation to improve the performance of UDA, which consists of three parts including adversarial masked selfdistillation, masked image modeling, and adversarial crossdomain adaptation objective.The proposed TAMS simultaneously takes class-level, pixel-level, and patch-level representations into account.Experimental results show that the proposed TAMS outperforms existing state-of-the-art ResNet-based and ViT-based methods on three benchmark datasets.In future work, we will further apply TAMS to other computer vision tasks such as object detection and semantic segmentation.
we jointly optimize three objectives to transfer knowledge for the UDA task: (1) adversarial masked self-distillation objective ("Adversarial masked self-distillation") to learn global task-specific class-level global representation; (2) masked image modeling objective ("Masked image modeling") to learn local pixel-level representation; (3) adversarial weighted cross-domain adaptation objective ("Adversarial weighted cross-domain adaptation") to learn domain-specific patch-level representation.Denote that the labeled data in the source domain are used to train the student branch, which has been omitted

Fig. 2
Fig. 2 The weighted cross-attention for Transformer.Using entropy H (D p (K patch )) to assign weights to different patches, the cross-attention mechanism (Eq.(6)) can not only capture semantic importance (softmax( QK T √ d

Fig. 3 Fig. 4
Fig. 3 The average attention distance in different attention heads (dots).Top row: TAMS does not use masked image modeling (MIM); Bottom row: TAMS uses masked image modeling (MIM)

Fig. 6 Fig. 7
Fig. 6 Training curve on three tasks (i.e., D → W, A → D and W → A) of Office-31.a Plots of the diversity of model predictions on target domain data[17]; b class-level adversarial loss curve on Eq. (2); c patch-level adversarial loss curve on Eq.(7)

Fig. 8
Fig. 8 Attention maps of images on Office-31 dataset.a D → A; b W → A. The hotter the color, the higher the attention

Table 4
Ablation study of each module