DynamicAug: Enhancing Transfer Learning through Dynamic Data Augmentation Strategies Based on Model State

Transfer learning has made signiﬁcant advancements, however, the issue of over-ﬁtting continues to pose a major challenge. Data augmentation has emerged as a highly promising technique to counteract this challenge. Current data augmentation methods are ﬁxed in nature, requiring manual determination of the appropriate intensity prior to the training process. However, this entails substantial computational costs. Additionally, as the model approaches convergence, static data augmentation strategies can become suboptimal. In this paper, we introduce the concept of Dynamic Data Aug mentation (DynamicAug), a method that autonomously adjusts the intensity of data augmentation, taking into account the convergence state of the model. During each iteration of the model’s forward pass, we utilize a Gaussian distribution based sampler to stochastically sample the current intensity of data augmentation. To ensure that the sampled intensity is aligned with the convergence state of the model, we introduce a learnable expectation to the sampler and update the expectation iteratively. In order to assess the convergence status of the model, we introduce a novel loss function called the convergence loss. Through extensive experiments conducted over 27 vision datasets, we have demonstrated that DynamicAug can signiﬁcantly enhance the performance of existing transfer learning methods.


Introduction
Transfer learning has gained significant traction in the field of computer vision for a multitude of visual tasks.These tasks encompass image classification [1,2], image segmentation [3,4], and object detection [5,6].By leveraging transfer learning, practitioners can capitalize on pre-existing models that have been trained on large-scale datasets [5,[7][8][9].This approach allows for the extraction of valuable features and knowledge, which can subsequently be transferred or fine-tuned to tackle specific tasks with limited available data.Consequently, transfer learning has emerged as a powerful technique, enabling enhanced performance and efficiency in various computer vision applications.
Fine-tuning pre-trained models has proven to be resource-efficient and can yield commendable performance.However, the impracticality lies in the need to fine-tune the entire model [10].Fine-tuning a complete model demands substantial computational resources and time, which can be prohibitive in real-world scenarios.To address this challenge, researchers have explored strategies such as transfer learning with partial fine-tuning [10][11][12][13] or selective layer freezing.These techniques involve selectively updating only specific layers or parts of the model while keeping other components frozen.By doing so, the computational burden is significantly reduced, and the finetuning process becomes more feasible.These approaches strike a balance between leveraging pre-trained knowledge and adapting the model to the target task with limited updates.They offer a practical compromise that allows for efficient utilization of resources while still achieving satisfactory performance in various computer vision tasks.
Nevertheless, the datasets employed in transfer learning often exhibit limitations, particularly in terms of their size or scope [14].Consequently, models fine-tuned on such datasets are susceptible to either overfitting or undertraining.Several mitigation strategies have been proposed, among which data augmentation has emerged as a particularly effective technique.Data augmentation involves applying a variety of transformations to the input images during model training to increase the diversity of the training data.By exposing the model to a broader range of underlying data, data augmentation enhances the model's capacity for generalization and improves its ability to handle previously unseen inputs with greater robustness.Commonly employed techniques for data augmentation include mixup-based methods [15,16], RandAugment [17], and others.It is worth noting that the Drop method [18] can also be utilized as a hidden technique for data augmentation in order to combat overfitting.
Current fine-tuning paradigms require manual determination of the strength of these data-augmentation hyperparameters prior to training [13,19].However, the finetuning process is highly sensitive to hyperparameters of data augmentation.In order to achieve optimal fine-tuning performance, a considerable number of hyperparameters related to data augmentation need to be searched(e.g. drop ratio in Drop and mix up ratio in mixup).However, manually tuning these hyperparameters is inefficient and requires significant resource expenditure.Additionally, the fine-tuning hyperparameters determined via manual search are typically static, implying that they will not change during the whole fine-tuning process.The effectiveness of static hyperparameters for model fine-tuning is a topic of debate since dynamic learning rates have shown to be a superior approach.
To enhance the thoroughness of the model fine-tuning process and mitigate overfitting, we conducted an investigation on the impact of dynamic augmentation method in fine-tuning tasks and innovatively proposed a dynamic model-aware augmentation strategy named DynamicAug, which enables adaptive adjustment of the data augmentation intensity based on the model's convergence state.Unlike traditional static augmentation techniques, dynamic augmentation methods adaptively adjust the augmentation strategy during the training process based on real-time feedback and the current state of the model.The intuition behind DynamicAug originates from the observation that the model exhibits varying levels of convergence at different training stages.Various factors such as diverse datasets, model variations, and training strategies can significantly influence the convergence states exhibited by a model throughout its training process.And consequently, different levels of regularization are required for each convergence degree.Hence, it is essential to dynamically adjust the regularization level based on the model's convergence state.
To achieve this, we implemented a criterion that dynamically adjusts the augmentation strategy every iteration by introducing a novel loss function that evaluates the convergence state of the model during training and keeps track of the adjusted augmentation intensity for future retraining.In addition, to ensure the exploratory nature of the learnable augmentation strategy, the hyperparameters required for data augmentation are all sampled from the Gaussian Distribution based Data Augmentation Sampler(DAS).We utilize DAS to acquire dynamic data enhancement strategies.The entire training process adopts an end-to-end training method without additional computational overhead.
For the specific data augmentation strategies, we primarily focuses on the dynamic implementation of commonly employed static data augmentation techniques, including mixup-based methods, RandAugment method, and Drop method.The intensity of these three methods is controlled by adjusting specific parameters.For the mixup-based method, the ratio is modified to regulate the intensity.In the case of RandAugment, the magnitude is adjusted to control the intensity.Lastly, for the Drop method, the drop ratio in each Transformer block is manipulated to regulate the intensity.By dynamically modifying the augmentation intensity, these methods are effective in addressing challenges related to overfitting and limited training data, thereby facilitating optimal model training.
We evaluate DynamicAug on 27 classification datasets [20][21][22][23][24] in total and conduct extensive experiments on transfer learning tasks, including different pretrained weights tasks, different augmentation method tasks and different transfer learning models.In addition, DynamicAug requires few additional parameters and is easily extended to various model families.Fig 1 demonstrates the disparity between DynamicAug and fixed data augmentation methods.Furthermore, we present sufficient experimental outcomes to assess its overall effectiveness.It is noteworthy that, upon optimization through dynamic data augmentation, the traditional fine-tuning approach can outperform the current state-of-the-art (SOTA) fine-tuning method SPT [25].In summary, the contributions can be summarized as follows: 1. We innovatively propose a dynamic model-aware augmentation approach.This approach can dynamically and adaptively adjust the data augmentation intensity based on the model's convergence state.The model achieved through dynamic data augmentation fine-tuning demonstrates superior performance compared to the model obtained through static data augmentation fine-tuning.2. The proposed DynamicAug method is not restricted to the modified LoRA [26], Adapter [27] and VPT [10] methods.It can be seamlessly combined with the mainstream fine-tuning methods.3. Our experiments validate that DynamicAug is a valuable supplement to current fine-tuning strategies and significantly improves model performance.For instance, DynamicAug improves the LoRA fine-tuning method by achieving an average accuracy increase of +1.7% on VTAB-1k, which even surpasses that of the best fine-tuning architecture searched in NOAH [13] and SPT [25].
2 Related work

Data augmentation
Data augmentation is a technique that artificially generates additional data while preserving the original data distribution within the training set.Commonly employed techniques for data augmentation include mixup-based methods [15,16], RandAugment [17], among others.Moreover, Drop method [18,28] can also be regarded as an effective data enhancement strategy.Drop: Methods like Drop achieve improved model performance or robustness by randomly deactivating parts of the model's structure during training.For instance, in the dropout method [18], certain neurons stop working with a specific probability during the forward propagation process in the fully connected layer.This process significantly improves the model's generalization and reduces its sensitivity to local features.Likewise, in the drop path method [28], random paths in the multi-branch network structure are deactivated, removing the dependence of weight update on the joint action of the fixed relationship between the various branches.Typically, these methods are included during model training as a means of regularization to prevent the model from overfitting.
RandAugment: RandAugment [17] is an improvement upon AutoAugment [29], where it maintains the probabilities of all image processing methods while varying only the number and intensity of the image processing types.These processing methods mainly include 14 different transformations such as identity, autocontrast, among others.Notably, the RandAugment algorithm primarily consists of three hyperparameters: magnitude m, standard deviation of magnitude M , and the number of transformations N .Increasing the value of M and N enhances the intensity of data augmentation, while the standard deviation of magnitude is often assigned a value of 0.5.
CutMix and Mixup: Mixup [15] and CutMix [16] are data augmentation methods that fuse different parts of images to generate new training samples.The Mixup method randomly selects two samples and performs a linear weighted summation.The CutMix technique introduces a novel methodology for merging parts of an image by cutting and pasting them onto another image, thereby accomplishing image merging in a region-based manner.New samples generated using the Mixup and CutMix methods help compensate for a lack of training image datasets and expand the training data to some extent.Given two input images, x A and x B , with their corresponding labels y A and y B .Mixup generates a new mixed image x, using a linear combination of the pixel values from A and B. This can be described as ( where λ obeys the Beta distribution with the parameter 0.8.And for Cuxmix, the processing can be fomulated as where M ∈ {0, 1} W ×H represents a binary mask associated with λ, and W and H denote the width and height of the input images, respectively.λ here obeys the Beta distribution with the parameter 1.0.These techniques have proven to be highly effective at incorporating richer variations in the training data, while simultaneously preserving local spatial information.Overall, data augmentation aims to enable the model to learn the characteristics of the data in a more comprehensive, multi-faceted manner instead of relying on one-sided information.

Transfer learning
Pre-training followed by fine-tuning is the most frequently utilized transfer learning approach [10,26,27,[30][31][32].The utilization of pre-training and fine-tuning approaches in the field of computer vision has consistently demonstrated remarkable empirical performance across various tasks.While fine-tuning pre-trained models offers notable efficiency and performance advantages, the practicality of fine-tuning the entire model remains limited in many scenarios.Hence, an innovative approach involves replacing full fine-tuning with the selective tuning of only a few trainable parameters, while keeping the majority of parameters shared across multiple tasks frozen.This approach typically fine-tunes less than 1% of the parameters of the complete model, yet its overall performance can be equivalent to or surpass that achieved through full fine-tuning.By significantly reducing the storage burden and making optimization less challenging, tuning fewer parameters when adapting pre-trained models to target datasets becomes more feasible.As a result, this approach can achieve comparable or even superior performance to that of full fine-tuning.The current mainstream parameter efficient tuning methods can be broadly categorized into two, namely addition-based parameter tuning methods [10,27,30] and reparameterization-based tuning methods [33][34][35].
In the addition-based parameter efficient tuning approach, extra trainable parameters are appended to the backbone model, and solely these supplementary parameters are fine-tuned during the adaptation process.The existing parameters of the backbone model remain fixed throughout the adaptation process.For instance, two notable examples of addition-based parameter efficient tuning methods are the Visual Prompt Tuning (VPT) algorithm [10] and the Adapter algorithmr [27].VPT employs a technique where a group of learnable tokens is placed at the beginning of the input sequence for each Transformer block.And the input of VPT can be represented by x vpt = [x cls , P], where x cls represents the class token and P represents the m added learnable tokens and P = {p k |p k ∈ R D , k = 1, ..., m}.D is the dimension of each token.Adapter is a lightweight network after the MLP layers in Transformer and it can be fomulated as where Linear up and Linear down are constructed with a small number of parameters, alongside QuickGELU which functions as the activation function.
The reparameterization-based methods adjust parameters that are either inherent to the backbone or can be reparameterized within it during the inference process.For example, LoRA [26] add the additioned parameter ∆W for Q and K in each Transformer block, where ∆W is composed of a matrix A q/k ∈ R r×D for ascending dimensions and a matrix B q/k ∈ R D×r for descending dimensions.r is the downprojection dimension (generally set to 8 or 16).During training, only A and B require iterative updates, while the other parameters of the model remain frozen.The forward Fig. 2 Overview of our DynamicAug method.On the basis of the existed transfer learning methods, we dynamically update the intensity of data augmentation for fine-tuning.we first replace the conventional data augmentation methods with dynamic paradigms and incorporate additional parameters to update the intensity of data augmentation.Therefore, the α is a learnable hyperparameter embedded in model.In order to maintain the exploratory nature of α, we introduce the Gaussian Distribution based Data Augmentation Sampler (DAS) for sampling and use the augmentation policies sampled from DAS for fine-tuning.Finally, we evaluate the convergence state of the model by a new introduced loss and then update α.We named it convergence loss in Eq. 9.
process of Q and K can be formulated as

Approach
In this section, we will begin by introducing DynamicAug, a dynamic optimization method that extends the traditional augmentation approach with minimal additional parameters.We show the overview of our method in Fig. 2. To preserve the exploratory nature of the learnable hyperparameters, we have incorporated a Gaussian Distribution based Data Augmentation Sampler for sampling.We leverage this sampler to sample an augmentation strategy, which is then used for fine-tuning purposes.Finally, to update the augmentation strategy based on a reasonable evaluation of the model's convergence status, we propose a novel loss function referred to as the convergence loss.

Dynamic data augmentation
Models can exhibit distinct convergence states and therefore require appropriate training strategies.Fixed strategies used in previous studies may lead to insufficient training or overfitting under certain conditions [26,27].However, a suitable strategy can also aid in the training process.For instance, studies have demonstrated that applying the drop method only during the initial stages of training can improve model fitting to data compared to using it throughout the entire process in certain scenarios [36].
To comprehensively tackle this issue, we introduce the adaptive data augmentation method called DynamicAug.The purpose of DynamicAug is to determine the intensity of data augmentation during fine-tuning.
For Drop.The traditional drop method of the forward pass for each Transformer block can be generally expressed as follows: where x i and y i represent the inputs and outputs of the i-th ViT block.The drop ratio of the corresponding block is denoted by r i , which increases or decreases as the depth i increases, like [37].But in DynamicAug, we establish parameters for r i and update them within each Transformer block during the iteration.And Eq. 5 can be rewrite as: p i d here is a dynamic drop ratio that updated with the improvement of model.We added a learnable hyperparameter α i d for each p i d .During the fine-tuning process, we will sample the p i d by α i d before each iteration through a truncated normal sampler, and constantly update α i d according to the model convergence state.For Mixup and Cutmix.The conventional mixup-based methods involve mixing the values and labels of two images using a linear ratio λ.And the prevalent training paradigm involves alternating between mixup and cutmix methods, with a default probability of 0.5 for switching between them.Nonetheless, it should be noted that regardless of the approach employed, the parameter λ serves as the governing factor for controlling the intensity of data augmentation.Consequently, during the fine-tuning process, we dynamically adjust the value of λ with [p m , p c ].To learn its optimal value based on the convergence state, we add two parameters [α m , α c ] for [p m , p c ].
For RandAugment.During the implementation of RandAugment, a predetermined set of transformation operations is randomly selected.Subsequently, these operations are applied to the samples in accordance with preset hyperparameters as x = f p1,m (...(f pn,m (x))...), (7) where m denotes the magnitude of the transformation and n signifies the number of transformations applied.p represents the specific transform operation.
In DynamicAug, we dynamically update the value of m with p r and sample the p r with α r to control the intensity of data augmentation.Hence, RandAugment requires only one parameter to achieve fine-tuning dynamics.

Gaussian Distribution based Data Augmentation Sampler
Update α with ∇ α and β zero the gradient end while end for Function Gradient(b val , α, w) : The Gaussian Distribution based Data Augmentation Sampler (DAS) serves the purpose of introducing noise and enhancing the exploratory nature of the data augmentation hyperparameters.Our preference is for the model to prioritize capturing the trend of changes in augmentation strategies rather than specific hyperparameter values.
So in DynamicAug, we randomly sample the data augmentation strategies p from the DAS Ψ with the learnable expectation α for fine-tuning in real time.The DAS can be formulated as: where σ 2 is a fixed variance.a i and b i refer to the left and right borders for the truncated distribution, respectively.For the strategies p, we have p = [p 1 drop , ..., p i drop ] for the dynamic drop strategies in Transformer models with the depth i, p = [p m , p c ] for the dynamic mixup-based methods and p = [p r ] for the dynamic RandAugment method.Obviously, DynamicAug introduces very few additional parameters.To update α end to end, we use Straight-through estimator (STE) [38] to incorporate α into the forward process.Details can be seen in Algorithm.1. Mixup-base methods and Randaugment share a similar characteristic as they operate prior to the samples entering the model.In contrast, Drop functions during the whole forward process of the model.Therefore, the methods by which they are updated also diverge.Regarding Drop, we incorporate the corresponding parameter α i in each block during the forward process.For the other two methods, α is multiplied at the end of the model's forward process to assess the state of the model.
To provide further evidence, we conducted ablation studies comparing the use of DAS to not using one.The results demonstrate a significant decrease in model performance when a sampler is not utilized.Moreover, recent studies have also employed samplers to obtain corresponding strategies [39].
Since α is a hyperparameter determined by the convergence behavior of the model, its optimization process differs from that of the model parameter w.Therefore, we introduced a new loss function specifically for optimizing α.

Loss for model convergence state
Since obtaining the model's convergence state directly on the evaluation set is still a challenging problem, we have proposed an alternative approach to simulate the convergence state and subsequently update the α parameter.Our approach centers on monitoring the alteration in model loss and entails reserving a designated number of samples from the training set to construct a validation set that emulates the evaluation data.Specifically, we have introduced a novel loss function, referred to as the Convergence loss, which serves as the optimization objective for α, In the proposed loss function, L val represents the Cross-entropy loss and w represents the common model weight.During the actual training process, we can estimate the change in loss by measuring the difference between iterations.Subsequently, Eq. 9 can be approximated as where f is the update frequency of α and t is the current training iteration.Generally, a larger negative incremental loss indicates a greater potential for improvement in model performance, suggesting that the model has not yet converged.Therefore, our ultimate optimization goal can be expressed as follows: To update α, we can obtain its gradient ∇ α by storing the gradient in the previous iteration and then again in the subsequent f iteration.
During the update of α, we explicitly set the drop ratio to 0 or turn off the mixup method and RandAugment method in order to assess the model's performance in the evaluation state.Within the set of partitioned validations, α plays a crucial role in the forward process of each block, and its gradient is maintained through Straight-Through Estimator (STE) during the backward process.We posted the pseudocode of α update in Algorithm.1.In the implementation, the augmentation parameter α is updated only every few iterations, minimizing any potential computational overhead.
Indeed, it is crucial to recognize that while the augmentation parameters may not exert a noticeable impact on the loss during individual training iterations, it does have a subtle influence on the overall training process over an extended period of time.

Experiments
In this section, we begin by outlining our experimental setup, which encompasses the datasets, baseline methods, and implementation details employed in our study.We then proceed to showcase the effectiveness of DynamicAug across multiple mainstream transfer learning tasks.Furthermore, we conduct ablation experiments to evaluate the impact and efficacy of DynamicAug in comparison to other approaches.Lastly, we delve into in-depth analyses to enhance our comprehension of the role of augmentations in transfer learning tasks.

Experimental Settings
Datasets.Our experiments primarily utilize three distinctive types of datasets.1) We employ the VTAB-1k [24] dataset, which serves as a benchmark for transfer learning in visual classification tasks.This dataset includes 19 classification tasks that are categorized into three domains: i) natural images captured by standard cameras; ii) professional images captured by non-standard cameras, such as remote sensing and medical cameras; iii) structured images synthesized from simulated environments.The benchmark consists of various tasks, including object counting and depth estimation, from diverse image domains.Due to the presence of only 1,000 training samples per task, the dataset's high level of complexity makes it extremely challenging for training benchmarks.2) Another benchmark we utilize is Fine-Grained Visual Classification (FGVC), which focuses on fine-grained visual classification tasks.This benchmark comprises several datasets, including the Stanford Dogs [20], Oxford Flowers [21], NABirds [22], CUB-200-2011 [23] and Stanford Cars [40].Each FGVC dataset consists of 55 to 200 categories and several thousand images for training, validation, and testing.In cases where the validation sets are not provided, we follow the specified validation split as indicated in [13].3) For the few-shot tasks, we select five fine-grained visual recognition datasets, namely Food101 [41], OxfordFlowers102 [42], Standford-Cars [43], OxfordPets [44], and FGVCAircraft [45].These datasets comprise categories that depict a diverse range of visual concepts closely associated with our daily lives, such as food, plants, vehicles, and animals.To assess the effectiveness of our approach, we follow previous studies [32,46] and evaluate the performance on 1, 2, 4, 8, and 16 shots, which are adequate for observing the trend.Baselines.Based on [25], our main experiments are conducted using a Vision Transformer backbone ViT-B/16, which is pretrained on ImageNet-21K.We incorporate the DynamicAug method into LoRA [26], Adapter [27] and Prompt-deep [10] to further enhance their performance.During the fine-tuning process, these three methods exclusively utilize the Drop data augmentation technique.To ensure a fair comparison, we conduct separate fine-tuning processes for the Mixup and Randaugment methods, thereby establishing baselines specific to these two data augmentation techniques.Furthermore, we perform additional experiments on the ViT-L/16 backbone to demonstrate the effectiveness of the DynamicAug method on larger models.In terms of training strategies, we explore both supervised pre-training and self-supervised pretraining techniques.Specifically, we employ the techniques of MAE [47] and MoCo v3 [48] to train our models.Finally, to assess the generalizability of DynamicAug, we incorporate it into the CLIP language-visual model [49,50] in the appendix.This evaluation allows us to examine the effectiveness and applicability of the DynamicAug method beyond just visual tasks, extending its potential to language-visual tasks as well.Implementation Details.Following [13], we utilize the AdamW optimizer [51] with cosine learning rate decay for our experiments.Specifically, for the VTAB experiments, we set the batch size to 64, the learning rate to 1 × 10 −3 , and the weight decay to 1 × 10 −4 .To ensure fairness, we follow the standard data augmentation pipeline [13].
To update the augmentation parameters, we initialize the number of validation samples to a certain ratio, such as 0.1, 0.2, or 0.4.The augmentation strategies is updated either every 5 iterations or the maximum number of iterations remaining for the image after segmentation.To optimize the performance of DynamicAug, we conduct a grid search on the hyperparameters involved.Importantly, DynamicAug only operates during training and is turned off during testing or verification stages.Therefore, it does not increase the model's inference time, ensuring efficient and practical implementation.
Remarkably, LoRA-DD outperforms the NOAH [13] and SPT [25] SOTA method with just 12 trainable parameters (equivalent to ViT-B/16 depth) in Tab. 2, which means that for a model, the training strategy's impact on model performance may be greater than that of architecture sometimes.
Notably, previous efficient fine-tuning methods have shown inconsistent results when applied to backbones with different pre-training strategies.To validate the effectiveness of DynamicAug under different pre-training strategies, we conducted experiments on MAE and MoCo v3 pre-trained backbones.The results are shown in Table 1 Comparisons between the traditional data augmentation methods and DynamicAug on VTAB-1k [24] benchmarks using supervised pre-trained ViT-B/16 backbone pre-trained on ImageNet-21k.The abbreviations FD, FM, and FR correspond to Fixed Drop, Fixed Mixup, and Fixed Randaugment respectively.Similarly, DD, DM, and DR represent Dynamic Drop, Dynamic Mixup, and Dynamic Randaugment respectively."Total params" denotes the ratio of the total number of parameters needed for all downstream tasks relative to the one for the pre-trained backbone, and "Tuned/Total" denotes the fraction of trainable parameters.Top-1 accuracy (%) is reported.

ViT-B/16
Total It should be noted that the 19 VTAB-1k datasets are incredibly small, comprising only 800 to 1000 samples per dataset.Dividing the dataset to evaluate the drop policy can have a considerable negative impact on the model's performance.Consequently, after obtaining the drop strategy, we have to retrain the model using the full dataset.Experiments on FGVC.We conducted the experiments on five fine-grained datasets for FGVC.As shown in Tab. 2, LoRA-DD outperforms the LoRA base by a clean margin of 1.2% mean top-1 accuracy.The results shows that DynamicAug can also perform well in fine-grained tasks.Experiments on Few-Shot Transfer Learning.Due to the particularity of the dataset, during the few-shot experiment, we divided the dataset proportionately into (0.1, 0.2, 0.4) for evaluation purposes.As depicted in Fig. 3, after applying the DynamicAug method (blue line), the average accuracy of LoRA significantly improved compared to using static drop (orange line).Furthermore, the average results were found to be on par with the current SOTA method, NOAH (green line), and surpassed Table 2 Comparisons on FGVC and VTAB-1k [24] benchmarks using supervised pre-trained ViT-B/16 backbone pre-trained on ImageNet-21k."Total params" denotes the ratio of the total number of parameters needed for all downstream tasks relative to the one for the pre-trained backbone, and "Tuned/Total" denotes the fraction of trainable parameters.Top-1 accuracy (%) is reported.The best result is in bold.Table 3 Comparisons on VTAB-1k [24] benchmark using self-supervised ViT-B/16 backbone pre-trained by MAE [47] and MoCo v3 [48]."Total params" denotes the ratio of the total number of parameters needed for all downstream tasks relative to the one for the pre-trained backbone, and "Tuned/Total" denotes the fraction of trainable parameters.Top-1 accuracy (%) is reported.The best result is in bold.

ViT-B/16
Total it in the low-data regime of 1-shot, 2-shot, and 4-shot.Notably, the performance of LoRA-DD in the FGVCAircraft dataset is outstanding.Although the other datasets did not display the same level of improvement, most of them outperformed LoRA with static drop.

Ablation Study
Effect of the DynamicAug.To assess the effectiveness of DynamicAug on VTAB-1k, we utilized different static drop rates as baselines and compared the dynamic drop strategy's efficacy with the static drop strategy.The findings are presented in Tab. 4.
On the original baseline with a 0.1 drop rate, the DynamicAug method increased the accuracy rate by 1.5%, 0.9%, and 2.8% in the Natural, Specialized, and Structured groups, respectively.Effect of Gaussian Distribution based Data Augmentation Sampler.As mentioned previously, after the initial training with the incomplete dataset, it is necessary to conduct retraining using the complete dataset.However, there exists a gap in the quantity of strategies obtained through the sampler during training and the strategies applied during the retraining process.
To address this issue, we examined the drop strategies generated with different approaches during the initial training process.We show the detailed drop curves in appendix and the ablation experiments in Tab. 5. (first, last, ave) in the table respectively represent the strategies generated during the first update, the last update in each epoch, and the average of the strategies within that epoch in the initial training.Therefore, we conducted ablation experiments on these three strategies by applying the (first, last, ave) strategy during each epoch of the retraining process.The results demonstrate that the model primarily focuses on the trend of regularization level changes rather than specific values.So we use the last drop policy of each epoch in retraining.
In addition, the Gaussian Distribution based Data Augmentation Sampler is also capable of introducing noise to drop and preventing the updates of drop from falling into the Matthew effect.We provided experimental results comparing the use of the normal sampler and not using it in Tab. 5.The results indicate that not using a sampler leads to a noticeable decline in model performance.Effect of the number of image divisions and retraining.The accuracy of evaluating model convergence is directly impacted by the number of image divisions employed.Gradually, the evaluation of convergence will affect the optimization direction of the drop.The divided training set and verification set should be able to fit the training process while evaluating the convergence degree of the model as accurately as possible.
As a result, we split a validation set from the training set with the ratio of 0.1, 0.2 and 0.4 for evaluating on VTAB-1k and the results are shown in Tab. 6.In addition, we counted the experimental results without retraining and marked it as LoRA-DDwithout retraining.Experiments show that a split ratio of 0.1 is sufficient to evaluate the convergence state of the model.Retraining again improves the effectiveness of the DynamicAug method based on the first training.Comparison on the manually scheduled drop and DynamicAug.Due to the successful application of manually designed learning rate schedules in deep learning, such as cosine and linear learning rates, we similarly designed a drop ratio schedule to verify whether artificially designed dynamic drop is feasible.We set the minimum and maximum values of drop to 0 and 0.1, respectively, and increased the drop value using linear and cosine methods as the number of epochs increased.The term "reversal" here means that the initial value of drop is set to 0.1 and it decreases as the number of epochs increases.As shown in Tab. 7, due to the unique characteristics of the drop parameter, it is challenging for it to exhibit similar behavior to the learning rate.Therefore, manually setting a dynamic drop schedule can lead to a significant decrease in model performance.This also highlights the importance of DynamicAug.Data augmentation method is a crucial component in most deep learning tasks, serving as a regularization method during model training to mitigate overfitting on local features and focus on global features instead.To better evaluate the impact of the DynamicAug method on transfer learning tasks, we painted the drop values for the three VTAB-1k dataset groups: Natural (7), Specialized (4) and Structured (8), and plotted the results in appendix.It is evident that different datasets require distinct optimization strategies.Additionally, we provide the loss differential figures for both fixed regularization strategy and dynamic regularization strategy acquired by DynamicAug.From Fig. 4, it is evident that the DynamicAug method significantly suppresses overfitting.In the end, we evaluated the generalizability of the DynamicAug method by incorporating it into the CLIP language-visual model in appendix.

Conclusion
In this study, we primarily investigate the impact of the dynamic data augmentation strategy on transfer learning tasks.Specifically, we propose a novel model-aware based DynamicAug strategy that continuously adjusts the intensity of data augmentation based on the model convergence state.It is worth mentioning that our approach is not restricted to the LoRA, Adapter and VPT fine-tuning methods that we have modified, as long as the model involving the data augmentation method can be improved with DynamicAug.
However, the convergence curves of different models on diverse datasets can vary significantly.As a result, it is challenging for our method to provide a universal drop strategy that applies to all models and datasets.Additionally, due to the evaluation requirements, we divided the dataset, resulting in a lack of training samples in the the first fine-tuning stage.Consequently, after obtaining the drop strategy, the model needs to be retrained.

A.1 Effect of DynamicAug on Large model
In order to demonstrate the effectiveness of the DynamicAug method on larger models, we performed extra experiments on the ViT-L/16 backbone.And results can be seen in Tab.A1.Obviously, the effect of the DynamicAug strategy to prevent overfitting will be more pronounced in larger models.

A.3 Detailed experiment results on VTAB-1k
We present the main results on VTAB-1k, including the drop strategy for the individual datasets and per-task results in Fig. A2 and Tab.A2.The experimental results show that after optimization with DynamicAug, the Adapter, Prompt-deep, and LoRA methods all significantly outperformed the original methods, even demonstrating significant improvements on every dataset.In addition, we also provide the results

A.5 Experiments on CLIP-Adapter
We provided the results of DynamicAug with CLIP models.The experimental setting is following CLIP-Adapter.Since the original paper did not include experimental results using ViT-B as the visual module, we reproduced the baseline ourselves.The final results can be shown in Fig. A3, which shows improvement across all 11 datasets.It is worth noting that since the datasets involved in CLIP-Adapter already have validation sets, we do not need to partition additional samples from the training set.This avoids the extra work of retraining.
As depicted in Fig. A3, after applying DynamicAug on CLIP-Adapter, the average accuracy of CLIP-Adapter-DD significantly improved compared to using static drop with 0.1 drop ratio.Even when drop ratio is set to 0, we are still able to achieve comparable results to it.This indicates that the CLIP model itself fits the data very well, but DynamicAug is still able to outperform it in part of the experiments.

Fig. 1
Fig. 1 In contrast to conventional static and manually curated data augmentation strategies, DynamicAug leverages the model itself to explore dynamic data augmentation strategies.We applied three conventional data augmentation methods to LoRA respectively and transformed them into dynamic paradigms.The experimental results are presented on the right side of the figure.LoRA-DA and LoRA represent the experimental outcomes obtained by applying DynamicAug and the static data augmentation method, respectively.Each of these results represents the average outcome derived from three separate experiments involving Mixup-based method, Randaugment, or Drop.The experimental results clearly indicate that dynamic data augmentation methods outperform traditional static methods by a significant margin.

Fig. 3
Fig.3The experimental results of few-shot transfer learning on five fine-grained visual recognition datasets after applying DynamicAug to LoRA.

Fig. 4
Fig. 4 Results of loss difference for both fixed regularization strategy and dynamic regularization strategy acquired by DynamicAug on VTAB-1k.LoRA-DD refers to applying DynamicAug method to LoRA.

Fig. A1
Fig. A1 Drop strategies obtained through different strategy acquisition methods for retraining.(first, last, ave) means the first, last and average drop value updated in each epoch.

Fig. A2
Fig. A2 We show the LoRA-DD's drop strategy for the individual datasets on VTAB-1k.Due to the particularity of the ViT structure, we use DropPath in DynamicAug method.
Params Tuned / Total Natural Specialized Structured Mean Acc.Tuned / Total Natural Specialized Structured Mean Acc.

Table 4
Ablation experiments for the effectiveness of DynamicAug strategy.LoRA-(0,0.1,0.2) refer to training LoRA with the drop rate of 0, 0.1 and 0.2 respectively.LoRA-DD refers to applying DynamicAug method to LoRA.

Table 5
Ablation study on drop strategies obtained through different strategy acquisition methods.LoRA-DD-(first, last, ave) refer to retraining LoRA-DD with the first, last and average drop policy in each epoch.LoRA-DD-without DAS is trained without the Data Augmentation Sampler.

Table 6
Ablation study on the effect of image divisions and retraining.LoRA-DD-(0.1,0.2,0.4) refer to training LoRA-DD with the image division of ratio 0.1, 0.2, 0.4 respectively for evaluating the convergence of model training.

Table 7
Comparisons between the manually designed drop ratio with DynamicAug.LoRA-(linear, cos) refer to applying linear or cos incremental drop schedule while fine-tuning.The term "reversal" here means that the drop decreases as the number of epochs increases.
4.4 Impacts of DynamicAug on Transfer Learning

Table A1
Comparisons on VTAB-1k benchmarks using supervised pre-trained ViT-L/16 backbone pre-trained on ImageNet-21k.LoRA-DD refers to applying DynamicAug method to LoRA.We analyzed the drop strategies generated using various approaches during the initial training process with the Gaussian Distribution based Data Augmentation Sampler.We recorded the results of the experiments, capturing the strategies generated during the first update, the last update in each epoch, and the average of the strategies within that epoch in the initial training process.As illustrated in Fig.A1, different acquisition strategies yield similar drop values in each epoch.This aligns with our expectation of prioritizing regularization trends over specific values.Subsequently, we retrained with the last updated drop strategy in each epoch.