1 Introduction

Inspired by the success in natural language processing, e.g., BERT [1] and GPT [2, 3], transformer is becoming more and more popular in various image processing tasks including recognition [4, 5], object detection [6], and video processing [7, 8].

Notably, Dosovitskiy et al. [4] propose Vision Transformer (ViT), which adapts a pure transformer for image classification, shows that it achieves comparable or superior performance to the conventional convolutional neural networks (CNN). Follow-up studies also show that ViT is more robust against the common corruptions and perturbations than convolution-based model (e.g., ResNet) [9, 10], which is an important property for safety-critical applications.

This paper seeks to answer the following question: can we further improve the robustness of ViT without retraining from scratch? Using heavy data augmentation during training is a natural way to improve the robustness, and prior works demonstrate that several data augmentation can indeed improve robustness of CNN [11, 12]. Another study empirically shows that sharpness-aware optimizer improves the robustness of ViT [13]. However, retraining ViT from scratch is not desirable, since it requires huge computational burden. Moreover, the dataset for the pretraining is sometimes not publicly available, making it impossible to retrain the model.

Test-time adaptation is a recently proposed approach for improving the robustness of the model [14,15,16]. In test-time adaptation, a model corrects its prediction of test data without looking at the label during test-time by modulating the small portion of model’s parameters [typically parameters of Batch Normalization (BN) and/or its statistics]. For example, Wang et al. [16] proposes Tent, which modulates the parameters of Batch Normalization (BN) by minimizing prediction entropy, and shows that it can significantly improve the robustness of ResNet. This approach is favorable for our case as it does not alter the training phase and thus does not need to repeat the computationally heavy training of ViT. Recently, Kojima et al. [17] have shown that the existing representative TTA methods, including Tent, are also effective on ViT by modulating parameters in ViT. Specifically, the study demonstrated that, for each TTA method, updating only the affine transformation parameters within layer-normalization layers in ViT boosts the performance on target datasets including common corruptions and domain shift. Besides the parameter choice, Kojima et al. [17] found gradient clipping [18], which is not used in the original Tent paper [16], is essential for applying Tent to ViT to mitigate catastrophic failure. However, we observed that just minimizing prediction entropy of ViT, as with Tent, often causes catastrophic failure especially under severe distribution shifts.

In this paper, we design a new loss function, called Attent, to stabilize the test-time adaptation of ViT. Attent adapts to target data by minimizing distributional difference of attention entropy between source data and target data. Optimization is performed for each layer, head, and token on the target data to make their distributions go back to the source data. The distributional statistics of attention entropy on the source dataset are calculated and stored in memory beforehand. Therefore, we can use this approach without source dataset during adaptation.

In summary, our main contributions are as follows.

  • This paper proposes a new test-time adaptation method for ViT. To mitigate the catastrophic failure of existing TTA method (Tent) on ViT, we introduce a new loss function called Attent. Attent minimizes the distributional differences of the attention entropy between the source and target in an online manner.

  • Using multiple standard datasets to benchmark the robustness against common corruption and perturbations, namely CIFAR-10-C, CIFAR-100-C, and ImageNet-C, we validate that robustness of ViT (ViT-B16 and ViT-L16) is improved by Attent without retraining the model from scratch.

  • We show that the robustness is further improved by combining Attent and other TTA methods that do not require source information, e.g., Tent or SHOT-IM. Especially, the improvement is significant for more severe corruption that Tent alone cannot recover. In addition, Attent is less sensitive to hyperparameters, which is a favorable property in practical setting.

2 Related Works

2.1 Robustness of ViT

2.1.1 Transformer Architecture

Models based on Transformer architecture [19] achieve great performance not only in NLP but also in image processing as Vision Transformer [4]. Self-attention is one of the building blocks in Transformer. Let \(z^l{\in }\mathbb {R}^{T \times D}\) be the input to \(l_{th}\) self-attention layer, where T is the number of tokens, and D is the number of features in each token. For each layer l and head h, the attention block takes \(z^l\) as input and compute attention weight matrix \(A^{lh}{\in }\mathbb {R}^{T{\times }T}\). Specifically, the attention weights between position i and j, \(A_{ij}^{lh}\) are calculated based on the inner dot product between their respective query \(Q_i^{lh}\) and key \(K_j^{lh}\) representations as follows:

$$\begin{aligned}{}[Q^{lh}, K^{lh}, V^{lh}]&= z_l {W}^{lh} , \, {W}^{lh} \in \mathbb {R}^{D \times 3D_h} \end{aligned}$$
(1)
$$\begin{aligned} A^{lh}&= softmax \left( \frac{Q^{lh}{K^{lh}}^T}{\sqrt{D_h}}\right) , \end{aligned}$$
(2)

where \({W}^{lh}\) is a learnable parameters. \(D_h\) is typically set to D/H to keep the number of parameters constant even when the number of head H is changed.

2.1.2 Inherent Robustness of ViT

Recent studies verify by experiments that ViT already has robustness without any adaptation or any additional data augmentation. Several studies empirically show that ViT is inherently more robust than CNNs using several benchmark datasets [9, 10]. The datasets include ImageNet-C (corruption), ImageNet-P (perturbations), ImageNet-R (semantic shifts), ImageNet-O (out-of-domain distribution), and ImageNet-9 (background dependence) [12, 20,21,22]. Other studies show that the robustness of ViT is further improved by changing the training strategy, such as using larger data set for pretraining phase [9, 23] or sharpness-aware optimizer for training phase [13].

2.2 Improving Robustness of CNN

2.2.1 Test-Time Adaptation

Test-time adaptation (TTA) is an online adaptation approach. Our proposal belongs to this category. Test-time adaptation does not require altering training process, so that this approach is generally applicable to wide variety of training method. This approach can adapt the model to the target data in a real-time mini-batch level to boost the accuracy in an online manner. Test-time batch normalization [14, 15] reestimates the statistics of the batch normalization [24] on the target dataset. Test-time entropy minimization (Tent) [16] adapts to the target unlabelled online data by minimizing the Shannon entropy [25] of its prediction by updating only the affine transformation parameters of batch normalization. Tent was proven to be effective when the model is CNN that has batch normalization, e.g., ResNet50 [26], while recent study [17] has demonstrated that Tent is also applicable to ViT (see Sect. 3.1 for the details). One can also use different loss functions for updating parameters, such as pseudo-label (PL) [27], diversity regularization (SHOT-IM) [28], feature alignment (TFA [29] and CFA [17]), or contrastive learning [30]. A recent study of Iwasawa et al. [31] has proposed gradient-free procedures to update only the classifier parameter of model (T3A). Our approach is categorized as a feature alignment approach, which minimizes the statistical distance between source and target dataset. It is worth noting that feature alignment approaches for test-time adaptation assume that one can access to the statistics on the source dataset during the test phase but does not need to access to the source dataset itself and to repeat the computationally heavy training [17]. Test-time adaptation generally assumes that the model would be distributed without source data due to bandwidth, privacy, or profit reasons [16]. We argue that the statistics of source data would be distributed even in such a situation, since it could drastically compress data size and eliminate sensitive information. In fact, some layers often used in typical neural networks contain statistics of source data (e.g., batch normalization) [17].

2.2.2 Data Augmentation

Several studies show that adding data augmentation naturally increases the robustness of model. For example, Hendrycks et al. [11] show that randomly selected noise operations and their compositions can improve the robustness against corruptions (AugMix). Hendrycks et al. [12] propose to make a variety of noises by passing training images into image-to-image network and introducing several perturbations during the forward pass, the output of which can be used as noisy training dataset (DeepAug). However, changing data augmentation usually needs to retrain the model from scratch, which is computationally heavy for the ViT.

2.2.3 Unsupervised Domain Adaptation

Several recent studies use unsupervised domain adaptation (UDA) [32,33,34] to improve the robustness of the CNN. For example, Xie et al. [35] propose Noisy-Student, which trained the model with pseudo-labeled data [27] that come from the target distribution. Similarly, Rusak et al. [36] propose to use pseudo-label technique with robust classification loss, which is called Generalized Cross Entropy (GCE) [37], as an objective function. These works prove the usefulness of UDA to robustify the model; however, this approach needs to preassume the type of corruption, so that the unlabeled data from the target domain need to be available at the training time. Unlike UDA, our approach does not need to preassume the corruption type, which is a more practical setting.

2.3 Central Moment Discrepancy

To stabilize the adaptation, our method uses central mean discrepancy (CMD) [38] to measure the discrepancy between source and target. Let X and Y be bounded random samples with respective probability distributions p and q on the interval \([a; b]^N\). Formally, CMD is defined by

$$\begin{aligned} \begin{aligned} CMD_k&= \frac{1}{|b-a|} \Vert \mathbb {E}(X) - \mathbb {E}(Y) \Vert _2\\&\quad + \frac{1}{|b-a|^k} \sum _{k=2}^K \Vert \mathbb {M}_k(X) - \mathbb {M}_k(Y) \Vert _2 , \end{aligned} \end{aligned}$$
(3)

where \(\mathbb {E}(X) = \frac{1}{|X|}\sum _{x \in X} x\) is the empirical expectation vector computed on the sample X. Here, \(|X|\) is denoted as the total number of samples following the annotation in [38]. \(\mathbb {M}_k(X) = \mathbb {E}((x-\mathbb {E}(X))^k)\) is the vector that consists of kth-order central moments for each element of X. Y follows the same idea. Previous works focus on using CMD in the field of UDA to reduce the distributional gaps between source representations and target ones. However, CMD can also be potentially used for test-time adaptation, because CMD does not need to store the source dataset itself; instead, we store central moment statistics of source data in memory, and use it during online adaptation for moment matching between source and target.

3 Methodology

3.1 Tent for ViT

Recently, Kojima et al. [17] study applying existing TTA methods including Tent to ViT. Assume that we have pretrained model whose parameters are denoted by \(\theta \). The model is trained on clean source data, and need to predict the data with unknown corruption. Given the input image x, the model outputs the conditional probability \(P(y \mid x;\theta )\). During test-time, Tent [16] minimizes the following prediction entropy using the stochastic gradient decent (SGD):

$$\begin{aligned} L_{\text {tent}} = \mathbb {E} \left[ \sum _{c=1}^C - P_{c}(x;\theta ) \log P_{c}(x;\theta ) \right] , \end{aligned}$$
(4)

where C is the number of class, and \(P_c\) is the estimated probability that x belongs to y confidence of class c.

While updating entire \(\theta \) is technically possible, it is known to be ineffective in the test-time adaptation. A key feature of Tent is that it does not alter the entire parameters \(\theta \), but it alters a small portion of the parameters \(\psi \in \theta \). Specifically, Tent [16] updates a set of parameters related to affine transformation of batch normalization. However, recent large models including ViT do not have batch normalization. Therefore, we do not have trivial answer to the question: which parameters should we update when applying Tent to ViT models? According to Wang et al. [16], updating only feature modulations that are linear and low-dimensional leads to stability and efficiency for adaptation.

[17] has found that, to boost the performance of ViT on target datasets, it is stably effective to update only the affine transformation parameters within layer-normalization layers in ViT for each TTA method. The layer normalization in ViT reestimates the mean and standard deviation of the input across the dimensions of inputs itself, followed by the affine transformation for each dimension. The notable difference between the modulation of layer normalization and batch normalization is that LayerNorm does not need to calculate the mean and standard deviation across multiple samples for normalization, i.e., layer normalization automatically reestimates the mean and standard deviation across dimensions on every single target data itself. Therefore, we only need to update the affine transformation parameters in layer normalization for adaptation. In addition, Kojima et al. [17] demonstrated that all the parameters can be updated with the best performance improvement when we use appropriate loss function, such as SHOT-IM or CFA. We experimentally find the best set of parameters for our case (Attent) in Sect. 4.5.

Besides the parameter choice, Kojima et al. [17] found that gradient clipping [18], which is not used in the original paper [16], is essential for applying Tent into ViT. Specifically, Kojima et al. [17] clip the gradients whose norm is greater than 1.0 in the entire experiments. Whether these techniques are unique to ViT (or huge models) needs further investigation, but without these techniques, Tent often gave catastrophic failure (significant drop of classification accuracy). See Sect. 4.7 for details.

3.2 Attent

Fig. 1
figure 1

Overview of our method (Attent). Similar to Liu et al. [29], our method consists of three stages: model training, offline statistics summarization, and online test-time adaptation. (1) ViT is trained in a supervised manner on a labeled source dataset. (2) After training, statistics (mean and higher order moments) of attention entropy on source dataset is calculated and stored in memory as fixed value. (3) During test-time, label is predicted, while partial parameters in ViT are updated by distribution matching of attention entropy between source and target dataset

Tent is sometimes unstable during adaptation even when we implement several techniques described in Sect. 3.1 (see Table 4 for details). The reason why Tent is unstable is that the objective function is just minimizing the entropy of classification. One of the extreme solution to this function is always assigning 100% probability to only one class, which indicates catastrophic failure of adaptation.

To alleviate this problem, we propose a new approach for test-time adaptation called Attent. In this approach, we focus on attention mechanism in ViT, which is one of the critical architectures for correct prediction [9, 39]. We hypothesize that the inner dot product between Q and K for the attention (Eq. 2) is shifted if ViT takes target data as input whose distribution is different from that of source data. Consequently, the attention weight distribution is shifted in an unexpected way. Our concept idea is to make the anomalous distribution goes back to normal by distribution matching of attention entropy between source and target data (see Fig. 1 for method overview). Simply minimizing attention entropy would fail, which may lead to paying attention to extremely narrow areas (see Sect. 4.5 for the experiment result). Similar to the most prior works, our method uses stochastic gradient decent to adapt the model during test-time. Unlike the prior methods, such as Tent [16], PL [27], and T3A [31] that modulate the parameters only using the data available at test-time, our method aligns the statistics of features between source and target. In other words, we leverage the source statistics as an auxiliary information regarding the source distribution to avoid catastrophic failure.

Let L, H, T be the number of layer, the number of head, and the number of tokens in transformer. Given a sample data from Source dataset \(x_n^{\mathcal {S}}{\in }X^{\mathcal {S}}\), attention entropy can be calculated for each layer \(l{\in }L\), head \(h{\in }H\), and tokens \(i{\in }T\). Specifically, following some annotations from Sect. 2.1.1, we define \(A^{lh}(x_{n}^{\mathcal {S}};\bar{\theta }){\in }\mathbb {R}^{T{\times }T}\) as the attention weight matrix parameterized by parameters \(\bar{\theta }\). \(\bar{\theta }\) are the parameters just after the training on source dataset, i.e., before adaptation. The attention entropy on source data sample is defined as follows:

$$\begin{aligned} \mathcal {H}^{\mathcal {S}n}_{lhi}&= \sum _{j=1}^T - A^{lh}_{ij}(x_{n}^{\mathcal {S}};\bar{\theta }) \log A^{lh}_{ij}(x_{n}^{\mathcal {S}};\bar{\theta }) . \end{aligned}$$
(5)

The mean and kth-order central moments of the entropy for source data are calculated by the following form and stored in memory as fixed values:

$$\begin{aligned} \mathbb {E}(\mathcal {H}^{\mathcal {S}}_{lhi})&= \frac{1}{|X^{\mathcal {S}}|} \sum _{x_n^{\mathcal {S}}{\in }X^{\mathcal {S}}} \mathcal {H}^{\mathcal {S}n}_{lhi} , \end{aligned}$$
(6)
$$\begin{aligned} \mathbb {M}_k(\mathcal {H}^{\mathcal {S}}_{lhi})&= \frac{1}{|X^{\mathcal {S}}|} \sum _{x_n^{\mathcal {S}}{\in }X^{\mathcal {S}}} ( \mathcal {H}^{\mathcal {S}n}_{lhi} - \mathbb {E}(\mathcal {H}^{\mathcal {S}}_{lhi}) )^k, \end{aligned}$$
(7)

where \(k = 2, \ldots ,K\). Attent uses these statistics to adapt the model during test phase.

During test-time phase, assume that a sequence of test data drawn from target distribution arrives at our model one after another. A set of test samples in mth batch is denoted as \(X^{\mathcal {T}}_{m} \subset X^{\mathcal {T}}\), \(m = 1, \ldots ,M\). For each sample in each batch \(x_{mn}^{\mathcal {T}} {\in }X^{\mathcal {T}}_{m}\), the attention entropy on target data is calculated like the first phase

$$\begin{aligned} \mathcal {H}^{\mathcal {T}mn}_{lhi}&= \sum _{j=1}^T - A^{lh}_{ij}(x_{mn}^{\mathcal {T}};\theta ) \log A^{lh}_{ij}(x_{mn}^{\mathcal {T}};\theta ). \end{aligned}$$
(8)

The mean and higher order central moments of the entropy are calculated for each target batch data

$$\begin{aligned} \mathbb {E}(\mathcal {H}^{\mathcal {T}m}_{lhi})&= \frac{1}{|X_{m}^{\mathcal {T}}|} \sum _{x_{mn}^{\mathcal {T}}{\in }X_{m}^{\mathcal {T}}} \mathcal {H}^{\mathcal {T}mn}_{lhi} , \end{aligned}$$
(9)
$$\begin{aligned} \mathbb {M}_k(\mathcal {H}^{\mathcal {T}m}_{lhi})&= \frac{1}{|X_{m}^{\mathcal {T}}|} \sum _{x_{mn}^{\mathcal {T}}{\in }X_{m}^{\mathcal {T}}} ( \mathcal {H}^{\mathcal {T}mn}_{lhi} - \mathbb {E}(\mathcal {H}^{\mathcal {T}m}_{lhi}) )^k. \end{aligned}$$
(10)

Note that for online adaptation, we can only use real-time batch data at hand. Let \(\mathbb {E}(\mathcal {H}_{lh})\) be the concatenation of \(\{\mathbb {E}(\mathcal {H}_{lhi}) \mid i \in T\}\) alongside the last dimension. Similarly, Let \(\mathbb {M}_k(\mathcal {H}_{lh})\) be the concatenation of \(\{\mathbb {M}_k(\mathcal {H}_{lhi}) \mid i \in T\}\) alongside the last dimension. A loss function for test-time adaptation is defined by applying the aforementioned central moments to CMD formula (Eq. 3)

$$\begin{aligned}{} & {} \begin{aligned} \mathcal {L}_{attn}^{lh}&= \frac{1}{\log T} \Vert \mathbb {E}(\mathcal {H}_{lh}^{\mathcal {T}m}) - \mathbb {E}(\mathcal {H}_{lh}^{\mathcal {S}}) \Vert _2\\&\quad + \frac{1}{(\log T)^k} \sum _{k=2}^K \Vert \mathbb {M}_k(\mathcal {H}_{lh}^{\mathcal {T}m}) - \mathbb {M}_k(\mathcal {H}_{lh}^{\mathcal {S}}) \Vert _2 , \end{aligned} \end{aligned}$$
(11)
$$\begin{aligned}{} & {} \mathcal {L}_{attn} = \frac{1}{LH} \sum _{l=1}^{L} \sum _{h=1}^{H} \mathcal {L}_{attn}^{lh}. \end{aligned}$$
(12)

The maximum and minimum values of the attention entropy with T tokens are calculated as \(\log {T}\) and 0. Therefore, \(|b-a|=\log {T}\) in Eq. 11. Like Tent, we update only few parameters in ViT by SGD to stabilize adaptation, which is detailed in Sect. 4.5.

Finally, this objective function (Eq. 12) can be used alone or in combination with Tent (Eq. 4). We can optionally combine Attent and Tent loss functions as follows:

$$\begin{aligned} \mathcal {L}_{\text {mix}} = \mathcal {L}_{\text {tent}} + \lambda \mathcal {L}_{\text {attn}}, \end{aligned}$$
(13)

where \(\lambda \) is a balancing hyperparameter. Following Wang et al. [16], the parameter update follows the prediction for the current batch. Therefore, the parameter update only affects the next batch. The adaptation procedure is summarized in Algorithm 1.

figure a

4 Experiment

We show that our approach is effective for corruption using CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets, respectively. We run all the experiments three times with different seeds for different data order shuffling. A mean and unbiased standard deviation of the metric is reported. Our implementation is in PyTorch [40], and every experiment is run on cloud A100 \(\times \) 1GPU instance except for ViT-L16 on A100 \(\times \) 2GPU instance.

4.1 Dataset and Preprocess

CIFAR-10/CIFAR-100 [41] and ImageNet [42] are used as source datasets. CIFAR-10/CIFAR-100 are, respectively, 10-class/100-class color image datasets including 50,000 training data and 10,000 test data with \(32{\times }32\) resolution. ImageNet is a 1000-class image dataset with more than 1.2 million training data and 50,000 validation data with various resolution.

CIFAR-10-C/CIFAR-100-C and ImageNet-C are used as target datasets for test-time adaptation [20]. These datasets, respectively, contain data with 15 types of corruption with 5-level severity. Therefore each dataset has 75 varieties of corruptions in total. Each corrupted image is created based on image from original CIFAR-10 and CIFAR-100 test images, and ImageNet validation images. Therefore, the CIFAR-10-C/CIFAR-100-C consists 10,000 images for each corruption/type and ImageNet-C dataset consists 50,000 images for each corruption/type.

For this experiment, images of all the datasets are preprocessed, so that the images are resized uniformly to \(224{\times }224\). As for ImageNet, some images are rectangle, so all the images are resized to 256 and center-cropped with \(224{\times }224\) size. ImageNet-C data have been already preprocessed the same way and publicly available. For ViT models, the pixels are first rescaled from [0, 255] to [0, 1]. Then, they are further rescaled to \([-1, 1]\) by normalization with mean and std specified as [0.5, 0.5, 0.5] and [0.5, 0.5, 0.5].

4.2 Model and Training Setting

Vision Transformer (ViT) is used as a model for this experiment. For CIFAR-10 and CIFAR-100, we use ViT-B16 as initial parameters for fine-tuning. ViT-B16 is already pretrained on ImageNet-21K [43], which is a large dataset with 21k classes and 14M images. For fine-tuning hyperparameters, following [4], we use batch size of 512, set optimizer as SGD with momentum 0.9 and gradient clipping at global norm 1.0. We choose a learning rate of 0.03. We apply cosine schedule of 200 warmup steps and the total number of iteration as 1000 for CIFAR-10. We apply a cosine schedule of 500 warmup steps and the total number of iterations as 2000 for CIFAR-100. The fine-tuning result is 1.1% Top-1 error for CIFAR-10, 6.8% for CIFAR-100, respectively. For ImageNet, we use parameters of ViT-B16 and ViT-L16 that are already pretrained for ImageNet-21K and also fine-tuned for ImageNet (-2012). Therefore, fine-tuning on ImageNet is not needed. The top-1 error on ImageNet is 18.6% by ViT-B16 and 17.1% by ViT-L16, respectively, in our setting. ViT-B16 has 12 layers and 12 heads. ViT-L16 has 24 layers and 16 heads [4]. Both models have input patch size of 16, so that the number of tokens for ViT is defined as 196 (= 14 \(\times \) 14) in this experiment.

4.3 Adaptation Setting

Before adaptation, we need to calculate the central moments of attention entropies for source data (Eqs. 6 and 7) to store them in memory. For this purpose, we use all the training data in source and set the dropout [44] off during the calculation.

As default hyperparameters for adaptation on target data, batch size is set to 64, optimization is SGD with constant learning rate 0.001 and momentum 0.9 with gradient clipping 1.0, and the maximum central moments’ order K is set to 3 across all the experiments. Dropout is set off during both forward and backward pass. We set \(\lambda =1.0\) in Eq. 13 to balance Tent and Attent losses. The detailed hyperparameter sweeping results are found at Sect. 4.7.

Top-1 error of classification is used as evaluation metric across all the experiments.

4.4 Baseline Methods

We compare Attent with some existing baseline test-time adaptation methods that do not need to alter training phase: Tent [16], PL [27], TFA(-) [29]Footnote 1 , T3A [31], SHOT-IM [28], and CFA [17]. In addition, we report the performance of the model on target datasets without any adaptation as Source. T-BN [14, 15] is excluded from the baseline, because ViT does not have a batch normalization layer. For a fair comparison, we use the same hyperparameter values across all the methods as described in Sect. 4.3. The detailed setting for each method is described in Appendix 1.

Table 1 Modulation parameter study
Table 2 Method comparison
Table 3 Model and corruption severity study
Table 4 Detail experiment results of method comparison

4.5 Quantitative Result

Table 1 answers the question about which modulation parameters are the most suitable for improving the performance of Attent. The results indicate that layer-normalization parameters can stably reduce top-1 error for both Attent and Tent. Therefore, in all the subsequent experiments, layer-normalization parameters are updated for adaptation across all the methods for fair comparison.

Table 2 summarizes the adaptation result (top-1 error) on CIFAR-10-C, CIFAR-100-C, and ImageNet-C for our method and other existing test-time adaptation baselines. We measure the top-1 error on each dataset with the highest severity (= 5). It is verified that Attent can stably reduce the error across all the datasets. Tent alone also has the stronger improvement gain than Attent. However, it is found that Tent is sometimes unstable on some of the corruptions with higher severity, while Attent is not. See Table 4 for details. For example, in Table 4, Tent on ViT-B16 fails to adapt to “snow” corruption on ImageNet-C, causing the significant performance drop from 60.9 to 75.7% top-1 error rate. Interestingly, combining Tent and Attent improves the performance more while preventing the aforementioned significant performance drop (See “Attent + Tent” at Table 2). We observe similar performance gains when combining Attent with other existing high-performance methods, such as SHOT-IM (whose objective is classifier entropy minimization + diversity regularizer), TFA(-) and CFA (whose objectives are minimizing distribution differences of the hidden representation just before the classifier of a model between the source and target). See “Attent + SHOT-IM”, “Attent + TFA(-)”, and “Attent + CFA” in Table 2.

To investigate the effectiveness of our approach, we introduce the following two analyses focusing on CMD. As a first analysis, instead of optimizing attention entropy in Attent by CMD, we simply minimize the attention entropy in the same way that Tent minimizes prediction entropy. This approach is described as “Attent - CMD” in Table 2. The result shows that this approach significantly deteriorates the performance. It indicates that matching distribution is necessary for attention entropy control instead of minimization. As a second analysis, we optimize the classification entropy distribution in Tent using CMD to make it go back to that of source dataset. This approach is described as “Tent + CMD” in Table 2. The result shows that the performance slightly deteriorates compared to the original Tent. It indicates that CMD is not appropriate for the classifier entropy.

Table 3 summarizes the adaptation results on ImageNet-C with severity from 1 to 5 based on various backbone networks: ResNet50, ViT-B16, and ViT-L16. The ResNet50 result is included as a reference to see the performance difference between a regular ResNet model and ViT models. Following [16], Tent for ResNet50 updates batch normalization parameters and reestimates the statistics. It is verified that both Tent and Attent consistently improve the performance across all the models for various degree of corruption severity. This indicates that Tent and Attent are model agnostic. The experiment also demonstrated that Attent + Tent further boosts the performance regardless of network backbones and corruption severity.

Table 5 The result of measuring attention map reconstruction metric based on Eq. 16

4.6 Attention Map Reconstruction

We quantitatively analyze how close the attention map gets after adaptation compared to the before. Specifically, for each sample, we measure cross entropy of attention maps between corrupted image \(x^{\mathcal {T}}\) in ImageNet-C and the corresponding clean image \(x^{{\mathcal{S}} \leftarrow {\mathcal{T}}}\) in ImageNet by the following formulas:

$$ \mathcal{H}_{{\text{Forward}}}^{mn} = \frac{1}{{LHT}}\sum\limits_{l,h,i,j} - A_{ij}^{lh}(x{_{mn}^{\mathcal{S} \leftarrow \mathcal{T}}};\bar \theta )\log A_{ij}^{lh}(x_{mn}^\mathcal{T};{\theta _m}),$$
(14)
$$\begin{aligned} \mathcal {H}^{mn}_{\text {Reverse}}&= \frac{1}{LHT} \sum _{l,h,i,j} - A^{lh}_{ij}(x_{mn}^{\mathcal {T}};\theta _{m}) \log A^{lh}_{ij}(x_{mn}^{{{\mathcal{S}} \leftarrow {\mathcal{T}}}};\bar{\theta }), \end{aligned}$$
(15)

where \(\theta _{m}\) is a set of parameters at the time of mth batch during test-time adaptation. Note that \(\theta _{m}=\bar{\theta }\) if without adaptation (Source). Intuitively, Eq. 14 gives us a higher penalty if the attention does not focus on correct locations, while Eq. 15 gives us a higher penalty if the attention focuses on wrong locations. We combine the two cross entropies and take the average across all the test data, making it the evaluation metric

$$\begin{aligned} \mathcal {H}&= \frac{1}{|X^{\mathcal {T}}|} \sum _{x_{mn}^{\mathcal {T}}{\in }X^{\mathcal {T}}} \left( \mathcal {H}^{mn}_{\text {Forward}} + {H}^{mn}_{\text {Reverse}} \right) . \end{aligned}$$
(16)

The lower score of Eq. 16 indicates the better attention map reconstruction.

Table 5 summarizes the result of measuring the metric using images from ImageNet and ImageNet-C of 15 corruptions with highest severity. The result demonstrates that Attent has the most tendency of reconstructing the attention map to the original one. This tendency may cause the improvement of image classification accuracy. Tent also has a tendency of reconstructing attention map, but not as much as Attent, which implies that the performance improvement by Tent is related to other latent variables as well as attention map. The score of Attent + Tent is between Tent and Attent. It can be assumed that attention map and other latent variables are optimized at the same time. In the case of Snow corruption, Tent fails in Adaptation and performance deteriorates substantially (see Table 4), but at the same time, the score of attention map reconstruction also deteriorates significantly (see Table 5), indicating the importance of attention map.

4.7 Hyperparameter Sensitivity

Fig. 2
figure 2

The effect of sweeping hyperparameters on each method. The evaluation metric is top-1 error on ImageNet-C averaged over 15 corruption types with highest severity. ViT-B16 is used as model. Either one of the hyperparameter values is changed from the default described in Sect. 4.3

For online adaptation, hyperparameter tuning is a challenging issue. Figure 2 shows the results for each hyperparameter sensitivity on ImageNet-C with highest severity averaged over 15 corruption types. We check the following hyperparameters by changing one of the values from the default described at Sect. 4.3: (a) learning rate, (b) batch size, (c) maximum number of central moments K in Eq. 11, and (d) whether to enable gradient clipping for SGD optimization. \(K=1\) denotes first-order moment (mean) matching only, which ignores higher order moment matching, i.e., Eqs. 7 and 10.

The important finding is that Tent is more sensitive to some hyperparameters described above than Attent. Especially, enabling gradient clipping is essential for applying Tent into ViT models for avoiding catastrophic failure of adaptation. Furthermore, large learning rate also leads to catastrophic failure of Tent. In contrast, Attent is quite insensitive to each hyperparameters. This indicates that we can use Attent safely in the unknown environment with rough hyperparameter tuning. It is also shown that the order of central moments K improves the performance of Attent, but the gain decreases as K gets larger. This is consistent with the original CMD study [38], which states that the performance is similar when \(K \ge 3\).

5 Conclusion and Future Work

This study proposed a novel method of test-time adaptation for ViT, called Attent, which adapts ViT by minimizing the distributional differences of the attention entropy between the source and target during test-time. Experiments on CIFAR-10-C, CIFAR-100-C, and ImageNet-C show that Attent is effective on various ViT models. By combining Attent and other TTA methods, the robustness is further improved. As a limitation, Attent is not effective for some of the domain adaptation benchmarks, such as digits style shift; e.g., from SVHN to MNIST/MNIST-M. Future work includes improving our method to adapt well on these tasks. We hope that research of test-time adaptation on ViT will be further encouraged by this study.