Robustifying Vision Transformer Without Retraining from Scratch Using Attention-Based Test-Time Adaptation

Vision Transformer (ViT) is becoming more and more popular in the field of image processing. This study aims to improve the robustness against the unknown perturbations without retraining the ViT model from scratch. Since our approach does not alter the training phase, it does not need to repeat computationally heavy pretraining of ViT. Specifically, we use test-time adaptation (TTA) for this purpose, which corrects its prediction during test-time by itself. The representative test-time adaptation method, Tent, is recently found to be applicable to ViT by modulating parameters and gradient clipping. However, we observed that Tent sometimes catastrophically fails, especially under severe perturbations. To stabilize the adaptation, we propose a new loss function called Attent, which minimizes the distributional differences of the attention entropy between the source and target. Experiments of image classification task on CIFAR-10-C, CIFAR-100-C, and ImageNet-C show that both Tent and Attent are effective on a wide variety of corruptions. The results also show that by combining Attent and Tent, the classification accuracy on corrupted data is further improved.

Notably, Dosovitskiy et al. [4] propose Vision Transformer (ViT), which adapts a pure transformer for image classification, shows that it achieves comparable or superior performance to the conventional convolutional neural networks (CNN). Follow-up studies also show that ViT is more robust against the common corruptions and perturbations than convolution-based model (e.g., ResNet) [9,10], which is an important property for safety-critical applications.
This paper seeks to answer the following question: can we further improve the robustness of ViT without retraining from scratch? Using heavy data augmentation during training is a natural way to improve the robustness, and prior works demonstrate that several data augmentation can indeed improve robustness of CNN [11,12]. Another study empirically shows that sharpness-aware optimizer improves the robustness of ViT [13]. However, retraining ViT from scratch is not desirable, since it requires huge computational burden. Moreover, the dataset for the pretraining is sometimes not publicly available, making it impossible to retrain the model.
Test-time adaptation is a recently proposed approach for improving the robustness of the model [14][15][16]. In test-time adaptation, a model corrects its prediction of test data without looking at the label during test-time by modulating the small portion of model's parameters [typically parameters of Batch Normalization (BN) and/or its statistics]. For example, Wang et al. [16] proposes Tent, which modulates the parameters of Batch Normalization (BN) by minimizing prediction entropy, and shows that it can significantly improve the robustness of ResNet. This approach is favorable for our case as it does not alter the training phase and thus does not need to repeat the computationally heavy training of ViT. Recently, Kojima et al. [17] have shown that the existing representative TTA methods, including Tent, are also effective on ViT by modulating parameters in ViT. Specifically, the study demonstrated that, for each TTA method, updating only the affine transformation parameters within layer-normalization layers in ViT boosts the performance on target datasets including common corruptions and domain shift. Besides the parameter choice, Kojima et al. [17] found gradient clipping [18], which is not used in the original Tent paper [16], is essential for applying Tent to ViT to mitigate catastrophic failure. However, we observed that just minimizing prediction entropy of ViT, as with Tent, often causes catastrophic failure especially under severe distribution shifts.
In this paper, we design a new loss function, called Attent, to stabilize the testtime adaptation of ViT. Attent adapts to target data by minimizing distributional difference of attention entropy between source data and target data. Optimization is performed for each layer, head, and token on the target data to make their distributions go back to the source data. The distributional statistics of attention entropy on the source dataset are calculated and stored in memory beforehand. Therefore, we can use this approach without source dataset during adaptation.
In summary, our main contributions are as follows.
• This paper proposes a new test-time adaptation method for ViT. To mitigate the catastrophic failure of existing TTA method (Tent) on ViT, we introduce a new loss function called Attent. Attent minimizes the distributional differences of the attention entropy between the source and target in an online manner. • Using multiple standard datasets to benchmark the robustness against common corruption and perturbations, namely CIFAR-10-C, CIFAR-100-C, and Ima-geNet-C, we validate that robustness of ViT (ViT-B16 and ViT-L16) is improved by Attent without retraining the model from scratch. • We show that the robustness is further improved by combining Attent and other TTA methods that do not require source information, e.g., Tent or SHOT-IM. Especially, the improvement is significant for more severe corruption that Tent alone cannot recover. In addition, Attent is less sensitive to hyperparameters, which is a favorable property in practical setting.

Transformer Architecture
Models based on Transformer architecture [19] achieve great performance not only in NLP but also in image processing as Vision Transformer [4]. Self-attention is one of the building blocks in Transformer. Let z l ∈ℝ T×D be the input to l th self-attention layer, where T is the number of tokens, and D is the number of features in each token. For each layer l and head h, the attention block takes z l as input and compute attention weight matrix A lh ∈ℝ T×T . Specifically, the attention weights between position i and j, A lh ij are calculated based on the inner dot product between their respective query Q lh i and key K lh j representations as follows: where W lh is a learnable parameters. D h is typically set to D/H to keep the number of parameters constant even when the number of head H is changed.

Inherent Robustness of ViT
Recent studies verify by experiments that ViT already has robustness without any adaptation or any additional data augmentation. Several studies empirically show that ViT is inherently more robust than CNNs using several benchmark datasets [9,10]. The datasets include ImageNet-C (corruption), ImageNet-P (perturbations), ImageNet-R (semantic shifts), ImageNet-O (out-of-domain distribution), and Ima-geNet-9 (background dependence) [12,[20][21][22]. Other studies show that the robustness of ViT is further improved by changing the training strategy, such as using larger data set for pretraining phase [9,23] or sharpness-aware optimizer for training phase [13].

Test-Time Adaptation
Test-time adaptation (TTA) is an online adaptation approach. Our proposal belongs to this category. Test-time adaptation does not require altering training process, so that this approach is generally applicable to wide variety of training method. This approach can adapt the model to the target data in a real-time mini-batch level to boost the accuracy in an online manner. Test-time batch normalization [14,15] reestimates the statistics of the batch normalization [24] on the target dataset. Test-time entropy minimization (Tent) [16] adapts to the target unlabelled online data by minimizing the Shannon entropy [25] of its prediction by updating only the affine transformation parameters of batch normalization. Tent was proven to be effective when the model is CNN that has batch normalization, e.g., ResNet50 [26], while recent study [17] has demonstrated that Tent is also applicable to ViT (see Sect. 3.1 for the details). One can also use different loss functions for updating parameters, such as pseudo-label (PL) [27], diversity regularization (SHOT-IM) [28], feature alignment (TFA [29] and CFA [17]), or contrastive learning [30]. A recent study of Iwasawa et al. [31] has proposed gradient-free procedures to update only the classifier parameter of model (T3A). Our approach is categorized as a feature alignment approach, which minimizes the statistical distance between source and target dataset. It is worth noting that feature alignment approaches for test-time adaptation assume that one can access to the statistics on the source dataset during the test phase but does not need to access to the source dataset itself and to repeat the computationally heavy training [17]. Test-time adaptation generally assumes that the model would be distributed without source data due to bandwidth, privacy, or profit reasons [16]. We argue that the statistics of source data would be distributed even in such a situation, since it could drastically compress data size and eliminate sensitive information. In fact, some layers often used in typical neural networks contain statistics of source data (e.g., batch normalization) [17].

Data Augmentation
Several studies show that adding data augmentation naturally increases the robustness of model. For example, Hendrycks et al. [11] show that randomly selected noise operations and their compositions can improve the robustness against corruptions (AugMix). Hendrycks et al. [12] propose to make a variety of noises by passing training images into image-to-image network and introducing several perturbations during the forward pass, the output of which can be used as noisy training dataset (DeepAug). However, changing data augmentation usually needs to retrain the model from scratch, which is computationally heavy for the ViT.

Unsupervised Domain Adaptation
Several recent studies use unsupervised domain adaptation (UDA) [32][33][34] to improve the robustness of the CNN. For example, Xie et al. [35] propose Noisy-Student, which trained the model with pseudo-labeled data [27] that come from the target distribution. Similarly, Rusak et al. [36] propose to use pseudolabel technique with robust classification loss, which is called Generalized Cross Entropy (GCE) [37], as an objective function. These works prove the usefulness of UDA to robustify the model; however, this approach needs to preassume the type of corruption, so that the unlabeled data from the target domain need to be available at the training time. Unlike UDA, our approach does not need to preassume the corruption type, which is a more practical setting.

Central Moment Discrepancy
To stabilize the adaptation, our method uses central mean discrepancy (CMD) [38] to measure the discrepancy between source and target. Let X and Y be bounded random samples with respective probability distributions p and q on the interval [a;b] N . Formally, CMD is defined by where (X) = 1 �X� ∑ x∈X x is the empirical expectation vector computed on the sample X. Here, |X| is denoted as the total number of samples following the annotation in [38]. k (X) = ((x − (X)) k ) is the vector that consists of kth-order central moments for each element of X. Y follows the same idea. Previous works focus on using CMD in the field of UDA to reduce the distributional gaps between source representations and target ones. However, CMD can also be potentially used for test-time adaptation, because CMD does not need to store the source dataset itself; instead, we store central moment statistics of source data in memory, and use it during online adaptation for moment matching between source and target.

Tent for ViT
Recently, Kojima et al. [17] study applying existing TTA methods including Tent to ViT. Assume that we have pretrained model whose parameters are denoted by . The model is trained on clean source data, and need to predict the data with unknown corruption. Given the input image x, the model outputs the conditional probability P(y | x; ) . During test-time, Tent [16] minimizes the following prediction entropy using the stochastic gradient decent (SGD): (3) where C is the number of class, and P c is the estimated probability that x belongs to y confidence of class c.
While updating entire is technically possible, it is known to be ineffective in the test-time adaptation. A key feature of Tent is that it does not alter the entire parameters , but it alters a small portion of the parameters ∈ . Specifically, Tent [16] updates a set of parameters related to affine transformation of batch normalization. However, recent large models including ViT do not have batch normalization. Therefore, we do not have trivial answer to the question: which parameters should we update when applying Tent to ViT models? According to Wang et al. [16], updating only feature modulations that are linear and low-dimensional leads to stability and efficiency for adaptation. [17] has found that, to boost the performance of ViT on target datasets, it is stably effective to update only the affine transformation parameters within layer-normalization layers in ViT for each TTA method. The layer normalization in ViT reestimates the mean and standard deviation of the input across the dimensions of inputs itself, followed by the affine transformation for each dimension. The notable difference between the modulation of layer normalization and batch normalization is that LayerNorm does not need to calculate the mean and standard deviation across multiple samples for normalization, i.e., layer normalization automatically reestimates the mean and standard deviation across dimensions on every single target data itself. Therefore, we only need to update the affine transformation parameters in layer normalization for adaptation. In addition, Kojima et al. [17] demonstrated that all the parameters can be updated with the best performance improvement when we use appropriate loss function, such as SHOT-IM or CFA. We experimentally find the best set of parameters for our case (Attent) in Sect. 4.5.
Besides the parameter choice, Kojima et al. [17] found that gradient clipping [18], which is not used in the original paper [16], is essential for applying Tent into ViT. Specifically, Kojima et al. [17] clip the gradients whose norm is greater than 1.0 in the entire experiments. Whether these techniques are unique to ViT (or huge models) needs further investigation, but without these techniques, Tent often gave catastrophic failure (significant drop of classification accuracy). See Sect. 4.7 for details.

Attent
Tent is sometimes unstable during adaptation even when we implement several techniques described in Sect. 3.1 (see Table 4 for details). The reason why Tent is unstable is that the objective function is just minimizing the entropy of classification. One of the extreme solution to this function is always assigning 100% probability to only one class, which indicates catastrophic failure of adaptation.
To alleviate this problem, we propose a new approach for test-time adaptation called Attent. In this approach, we focus on attention mechanism in ViT, which is one of the critical architectures for correct prediction [9,39]. We hypothesize that the inner dot product between Q and K for the attention (Eq. 2) is shifted if ViT takes target data as input whose distribution is different from that of source data. Consequently, the attention weight distribution is shifted in an unexpected way. Our concept idea is to make the anomalous distribution goes back to normal by distribution matching of attention entropy between source and target data (see Fig. 1 for method overview). Simply minimizing attention entropy would fail, which may lead to paying attention to extremely narrow areas (see Sect. 4.5 for the experiment result). Similar to the most prior works, our method uses stochastic gradient decent to adapt the model during test-time. Unlike the prior methods, such as Tent [16], PL [27], and T3A [31] that modulate the During test-time, label is predicted, while partial parameters in ViT are updated by distribution matching of attention entropy between source and target dataset parameters only using the data available at test-time, our method aligns the statistics of features between source and target. In other words, we leverage the source statistics as an auxiliary information regarding the source distribution to avoid catastrophic failure. Let L, H, T be the number of layer, the number of head, and the number of tokens in transformer. Given a sample data from Source dataset x S n ∈X S , attention entropy can be calculated for each layer l∈L , head h∈H , and tokens i∈T . Specifically, following some annotations from Sect. 2.1.1, we define A lh (x S n ;̄)∈ℝ T×T as the attention weight matrix parameterized by parameters ̄ . ̄ are the parameters just after the training on source dataset, i.e., before adaptation. The attention entropy on source data sample is defined as follows: The mean and kth-order central moments of the entropy for source data are calculated by the following form and stored in memory as fixed values: where k = 2, … , K . Attent uses these statistics to adapt the model during test phase.
During test-time phase, assume that a sequence of test data drawn from target distribution arrives at our model one after another. A set of test samples in mth batch is denoted as X T m ⊂ X T , m = 1, … , M . For each sample in each batch x T mn ∈X T m , the attention entropy on target data is calculated like the first phase The mean and higher order central moments of the entropy are calculated for each target batch data Note that for online adaptation, we can only use real-time batch data at hand. Let (H lh ) be the concatenation of { (H lhi ) | i ∈ T} alongside the last dimension. Similarly, Let k (H lh ) be the concatenation of { k (H lhi ) | i ∈ T} alongside the last dimension. A loss function for test-time adaptation is defined by applying the aforementioned central moments to CMD formula (Eq. 3) The maximum and minimum values of the attention entropy with T tokens are calculated as log T and 0. Therefore, |b − a| = log T in Eq. 11. Like Tent, we update only few parameters in ViT by SGD to stabilize adaptation, which is detailed in Sect. 4.5.
Finally, this objective function (Eq. 12) can be used alone or in combination with Tent (Eq. 4). We can optionally combine Attent and Tent loss functions as follows: where is a balancing hyperparameter. Following Wang et al. [16], the parameter update follows the prediction for the current batch. Therefore, the parameter update only affects the next batch. The adaptation procedure is summarized in Algorithm 1.

Algorithm 1 Online Adaptation using Attent
Input: Fine-tuned DNN model with parameters θ, and partial parameters to be updated during adaptation ψ ⊂ θ, Target test dataset X T , m-th ordered batch data X T m ⊂ X T , Statistics of Eq. 6 and Eq. 7 calculated from the source training dataset.

Experiment
We show that our approach is effective for corruption using CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets, respectively. We run all the experiments three times with different seeds for different data order shuffling. A mean and unbiased (11) (13) L mix = L tent + L attn , standard deviation of the metric is reported. Our implementation is in PyTorch [40], and every experiment is run on cloud A100 × 1GPU instance except for ViT-L16 on A100 × 2GPU instance.
CIFAR-10-C/CIFAR-100-C and ImageNet-C are used as target datasets for test-time adaptation [20]. These datasets, respectively, contain data with 15 types of corruption with 5-level severity. Therefore each dataset has 75 varieties of corruptions in total. Each corrupted image is created based on image from original CIFAR-10 and CIFAR-100 test images, and ImageNet validation images. Therefore, the CIFAR-10-C/CIFAR-100-C consists 10,000 images for each corruption/type and ImageNet-C dataset consists 50,000 images for each corruption/type.
For this experiment, images of all the datasets are preprocessed, so that the images are resized uniformly to 224×224 . As for ImageNet, some images are rectangle, so all the images are resized to 256 and center-cropped with 224×224 size. ImageNet-C data have been already preprocessed the same way and publicly available. For ViT models, the pixels are first rescaled from

Model and Training Setting
Vision Transformer (ViT) is used as a model for this experiment. For CIFAR-10 and CIFAR-100, we use ViT-B16 as initial parameters for fine-tuning. ViT-B16 is already pretrained on ImageNet-21K [43], which is a large dataset with 21k classes and 14M images. For fine-tuning hyperparameters, following [4], we use batch size of 512, set optimizer as SGD with momentum 0.9 and gradient clipping at global norm 1.0. We choose a learning rate of 0.03. We apply cosine schedule of 200 warmup steps and the total number of iteration as 1000 for CIFAR-10. We apply a cosine schedule of 500 warmup steps and the total number of iterations as 2000 for CIFAR-100. The fine-tuning result is 1.1% Top-1 error for CIFAR-10, 6.8% for CIFAR-100, respectively. For ImageNet, we use parameters of ViT-B16 and ViT-L16 that are already pretrained for ImageNet-21K and also fine-tuned for ImageNet (-2012). Therefore, fine-tuning on ImageNet is not needed. The top-1 error on Ima-geNet is 18.6% by ViT-B16 and 17.1% by ViT-L16, respectively, in our setting. ViT-B16 has 12 layers and 12 heads. ViT-L16 has 24 layers and 16 heads [4]. Both models have input patch size of 16, so that the number of tokens for ViT is defined as 196 (= 14 × 14) in this experiment.

Adaptation Setting
Before adaptation, we need to calculate the central moments of attention entropies for source data (Eqs. 6 and 7) to store them in memory. For this purpose, we use all the training data in source and set the dropout [44] off during the calculation.
As default hyperparameters for adaptation on target data, batch size is set to 64, optimization is SGD with constant learning rate 0.001 and momentum 0.9 with gradient clipping 1.0, and the maximum central moments' order K is set to 3 across all the experiments. Dropout is set off during both forward and backward pass. We set = 1.0 in Eq. 13 to balance Tent and Attent losses. The detailed hyperparameter sweeping results are found at Sect. 4.7.
Top-1 error of classification is used as evaluation metric across all the experiments.

Baseline Methods
We compare Attent with some existing baseline test-time adaptation methods that do not need to alter training phase: Tent [16], PL [27], TFA(-) [29] 1 , T3A [31], SHOT-IM [28], and CFA [17]. In addition, we report the performance of the model on target datasets without any adaptation as Source. T-BN [14,15] is excluded from the baseline, because ViT does not have a batch normalization layer. For a fair comparison, we use the same hyperparameter values across all the methods as described in Sect. 4.3. The detailed setting for each method is described in Appendix 1.

Table 1 Modulation parameter study
Evaluation metric is top-1 error on ImageNet-C for Gaussian Noise with the highest severity. ViT-B16 is used as a model. As a reference, the performance without any adaptation ("Source") was 61.9 ± 0.0 CLS CLS token (CLS token is a parameterized vector and proven to be efficient for fine-tuning large models for downstream tasks in NLP [45]), LN LayerNorm, ALL all the parameters of ViT  Table 1 answers the question about which modulation parameters are the most suitable for improving the performance of Attent. The results indicate that layer-normalization parameters can stably reduce top-1 error for both Attent and Tent. Therefore, in all the subsequent experiments, layer-normalization parameters are updated for adaptation across all the methods for fair comparison. Table 2 summarizes the adaptation result (top-1 error) on CIFAR-10-C, CIFAR-100-C, and ImageNet-C for our method and other existing test-time adaptation baselines. We measure the top-1 error on each dataset with the highest severity (= 5). It is verified that Attent can stably reduce the error across all the datasets. Tent alone also has the stronger improvement gain than Attent. However, it is found that Tent is sometimes unstable on some of the corruptions with higher severity, while Attent is not. See Table 4 for details. For example, in Table 4, Tent on ViT-B16 fails to adapt to "snow" corruption on ImageNet-C, causing the significant performance drop from 60.9 to 75.7% top-1 error rate. Interestingly, combining Tent and Attent improves the performance more while preventing the aforementioned significant performance drop (See "Attent + Tent" at Table 2). We observe similar performance gains when combining Attent with other existing high-performance methods, such as SHOT-IM (whose objective is classifier entropy minimization + diversity regularizer), TFA(-) and CFA (whose objectives are minimizing distribution differences of the hidden  See "Attent + SHOT-IM", "Attent + TFA(-)", and "Attent + CFA" in Table 2.

Quantitative Result
To investigate the effectiveness of our approach, we introduce the following two analyses focusing on CMD. As a first analysis, instead of optimizing attention entropy in Attent by CMD, we simply minimize the attention entropy in the same way that Tent minimizes prediction entropy. This approach is described as "Attent -CMD" in Table 2. The result shows that this approach significantly deteriorates the performance. It indicates that matching distribution is necessary for attention entropy control instead of minimization. As a second analysis, we optimize the classification entropy distribution in Tent using CMD to make it go back to that of source dataset. This approach is described as "Tent + CMD" in Table 2. The result shows that the performance slightly deteriorates compared to the original Tent. It indicates that CMD is not appropriate for the classifier entropy. Table 3 summarizes the adaptation results on ImageNet-C with severity from 1 to 5 based on various backbone networks: ResNet50, ViT-B16, and ViT-L16. The ResNet50 result is included as a reference to see the performance difference between a regular ResNet model and ViT models. Following [16], Tent for ResNet50 updates batch normalization parameters and reestimates the statistics. It is verified that both Tent and Attent consistently improve the performance across all the models for various degree of corruption severity. This indicates that Tent and Attent are model agnostic. The experiment also demonstrated that Attent + Tent further boosts the performance regardless of network backbones and corruption severity.

Attention Map Reconstruction
We quantitatively analyze how close the attention map gets after adaptation compared to the before. Specifically, for each sample, we measure cross entropy of attention maps between corrupted image x T in ImageNet-C and the corresponding clean image x S←T in ImageNet by the following formulas: where m is a set of parameters at the time of mth batch during test-time adaptation. Note that m =̄ if without adaptation (Source). Intuitively, Eq. 14 gives us a higher penalty if the attention does not focus on correct locations, while Eq. 15 gives us a higher penalty if the attention focuses on wrong locations. We combine the two The lower score of Eq. 16 indicates the better attention map reconstruction. Table 5 summarizes the result of measuring the metric using images from Ima-geNet and ImageNet-C of 15 corruptions with highest severity. The result demonstrates that Attent has the most tendency of reconstructing the attention map to the original one. This tendency may cause the improvement of image classification accuracy. Tent also has a tendency of reconstructing attention map, but not as much as Attent, which implies that the performance improvement by Tent is related to other latent variables as well as attention map. The score of Attent + Tent is between Tent and Attent. It can be assumed that attention map and other latent variables are optimized at the same time. In the case of Snow corruption, Tent fails in Adaptation and performance deteriorates substantially (see Table 4), but at the same time, the score of attention map reconstruction also deteriorates significantly (see Table 5), indicating the importance of attention map.

Hyperparameter Sensitivity
For online adaptation, hyperparameter tuning is a challenging issue. Figure 2 shows the results for each hyperparameter sensitivity on ImageNet-C with highest severity averaged over 15 corruption types. We check the following hyperparameters by changing one of the values from the default described at Sect. 4.3: (a) learning rate, (b) batch size, (c) maximum number of central moments K in Eq. 11, and (d) whether to enable gradient clipping for SGD optimization. K = 1 denotes first-order moment (mean) matching only, which ignores higher order moment matching, i.e., Eqs. 7 and 10.
The important finding is that Tent is more sensitive to some hyperparameters described above than Attent. Especially, enabling gradient clipping is essential for applying Tent into ViT models for avoiding catastrophic failure of adaptation. Furthermore, large learning rate also leads to catastrophic failure of Tent. In contrast, Attent is quite insensitive to each hyperparameters. This indicates that we can use Attent safely in the unknown environment with rough hyperparameter tuning. It is also shown that the order of central moments K improves the performance of Attent, but the gain decreases as K gets larger. This is consistent with the original CMD study [38], which states that the performance is similar when K ≥ 3.

Conclusion and Future Work
This study proposed a novel method of test-time adaptation for ViT, called Attent, which adapts ViT by minimizing the distributional differences of the attention entropy between the source and target during test-time. Experiments on CIFAR-10-C, CIFAR-100-C, and ImageNet-C show that Attent is effective on various ViT models. By combining Attent and other TTA methods, the robustness is further improved. As a limitation, Attent is not effective for some of the domain adaptation benchmarks, such as digits style shift; e.g., from SVHN to MNIST/MNIST-M. Future work includes improving our method to adapt well on these tasks. We hope that research of test-time adaptation on ViT will be further encouraged by this study. Table 5 The result of measuring attention map reconstruction metric based on Eq. 16 The value indicates how close the attention map is between clean and corresponding corrupted image. Images are used from ImageNet and ImageNet-C of 15 corruptions with highest severity. ViT-B16 is used as a model

A.1. TFA(-)
Test-time feature alignment (TFA) [29] aligns the hidden representation on target data by minimizing the distance of the mean vector s , t ∈ ℝ D and covariance matrix Σ s , Σ t ∈ ℝ D×D between source and target. D is the dimension size of the hidden representation. We focus only on the "Online Feature Alignment" part in TTT++ [29]. Original TFA [29] aligns the distributions at both the hidden representation and the output of the self-supervised head. However, in our experiment, TFA(-) does not employ self-supervised learning, so we only focus on distribution matching of the hidden representation. Specifically, in this experiment, the hidden representation to align is defined as the one before the classifier head h(x) = f (x; ) . The loss function is L = 1 ‖ s − t ‖ 2 2 + 2 ‖Σ s − Σ t ‖ 2 F , where ‖ ⋅ ‖ 2 is the Euclidean norm and ‖ ⋅ ‖ F is the Frobenius norm. and Σ are, respectively, mean vector and covariance matrix. 1 and 2 are balancing hyperparameters. Like CFA [17], TFA(-) calculates the statistics on source dataset and store them in memory before adaptation. Note that "Online Dynamic Queue" [29] is not used in TFA(-) in our experiment. Table 6 describes the preliminary experiment results of TFA(-) on ImageNet-C datasets with severity = 5 by changing the balancing hyperparameters 1 , 2 . For the main experiment in Table 2, we use default hyperparameter 1 = 1, 2 = 1 based on [29], as well as the best performance hyperparameter 1 = 1, 2 = 0. Fig. 2 The effect of sweeping hyperparameters on each method. The evaluation metric is top-1 error on ImageNet-C averaged over 15 corruption types with highest severity. ViT-B16 is used as model. Either one of the hyperparameter values is changed from the default described in Sect. 4.3

A.2. T3A
T3A [31] updates only the classifier module by the centroid of each class averaged over the pseudo-labeled samples' feature vectors in an online manner. This is a gradient-free approach and there is no loss function. The hyperparameter filter size K is set to 100 in our experiment.

A.3. CFA
CFA [17] minimizes both the class-conditional distribution differences and the whole distribution differences of the hidden representation just before the classifier of a model between the source and target in an online manner. The hyperparameters of CFA are the same as the default described in [17]. Specifically, the balancing hyperparameter between the whole distribution loss and class-conditional loss is set as 1. The maximum central moments' order K is set as 3.

Data Availability Statement
The experiment code for this study is not publicly available. The datasets used for the experiments in this study are publicly available through the Internet.

Conflict of interest
The authors declare no conflicts of interest associated with this manuscript.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. Table 6 Preliminary experiment results of TFA(-) on ImageNet-C by changing the balancing hyperparameters 1 , 2 . Evaluation metric is Top-1 error on ImageNet-C averaged over 15 corruption types with severity level=5. ViT-B16 is used as a model