1 Introduction

Transfer learning has gained significant traction in the field of computer vision for a multitude of visual tasks. These tasks encompass image classification [1, 2], image segmentation [3, 4], and object detection [5, 6]. By leveraging transfer learning, practitioners can capitalize on pre-existing models that have been trained on large-scale datasets [5, 7,8,9]. This approach allows for the extraction of valuable features and knowledge, which can subsequently be transferred or fine-tuned to tackle specific tasks with limited available data. Consequently, transfer learning has emerged as a powerful technique, enabling enhanced performance and efficiency in various computer vision applications.

Fine-tuning pre-trained models has proven to be resource-efficient and can yield commendable performance. However, the impracticality lies in the need to fine-tune the entire model [10]. Fine-tuning a complete model demands substantial computational resources and time, which can be prohibitive in real-world scenarios. To address this challenge, researchers have explored strategies such as transfer learning with partial fine-tuning [10,11,12,13] or selective layer freezing. These techniques involve selectively updating only specific layers or parts of the model while keeping other components frozen. By doing so, the computational burden is significantly reduced, and the fine-tuning process becomes more feasible. These approaches strike a balance between leveraging pre-trained knowledge and adapting the model to the target task with limited updates. They offer a practical compromise that allows for efficient utilization of resources while still achieving satisfactory performance in various computer vision tasks.

Nevertheless, the datasets employed in transfer learning often exhibit limitations, particularly in terms of their size or scope [14]. Consequently, models fine-tuned on such datasets are susceptible to either overfitting or undertraining. Several mitigation strategies have been proposed, among which data augmentation has emerged as a particularly effective technique. Data augmentation involves applying a variety of transformations to the input images during model training to increase the diversity of the training data. By exposing the model to a broader range of underlying data, data augmentation enhances the model’s capacity for generalization and improves its ability to handle previously unseen inputs with greater robustness. Commonly employed techniques for data augmentation include mixup-based methods [15, 16], RandAugment [17], and others. It is worth noting that the Drop method [18] can also be utilized as a hidden technique for data augmentation in order to combat overfitting.

Current fine-tuning paradigms require manual determination of the strength of these data-augmentation hyperparameters prior to training [13, 19]. However, the fine-tuning process is highly sensitive to hyperparameters of data augmentation. In order to achieve optimal fine-tuning performance, a considerable number of hyperparameters related to data augmentation need to be searched(e.g. drop ratio in Drop and mix up ratio in mixup). However, manually tuning these hyperparameters is inefficient and requires significant resource expenditure. Additionally, the fine-tuning hyperparameters determined via manual search are typically static, implying that they will not change during the whole fine-tuning process. The effectiveness of static hyperparameters for model fine-tuning is a topic of debate since dynamic learning rates have shown to be a superior approach.

Fig. 1
figure 1

In contrast to conventional static and manually curated data augmentation strategies, DynamicAug leverages the model itself to explore dynamic data augmentation strategies. We applied three conventional data augmentation methods to LoRA respectively and transformed them into dynamic paradigms. The experimental results are presented on the right side of the figure. LoRA-DA and LoRA represent the experimental outcomes obtained by applying DynamicAug and the static data augmentation method, respectively. Each of these results represents the average outcome derived from three separate experiments involving Mixup-based method, Randaugment, or Drop. The experimental results clearly indicate that dynamic data augmentation methods outperform traditional static methods by a significant margin

To enhance the thoroughness of the model fine-tuning process and mitigate overfitting, we conducted an investigation on the impact of dynamic augmentation method in fine-tuning tasks and innovatively proposed a dynamic model-aware augmentation strategy named DynamicAug, which enables adaptive adjustment of the data augmentation intensity based on the model’s convergence state. Unlike traditional static augmentation techniques, dynamic augmentation methods adaptively adjust the augmentation strategy during the training process based on real-time feedback and the current state of the model. The intuition behind DynamicAug originates from the observation that the model exhibits varying levels of convergence at different training stages. Various factors such as diverse datasets, model variations, and training strategies can significantly influence the convergence states exhibited by a model throughout its training process. And consequently, different levels of regularization are required for each convergence degree. Hence, it is essential to dynamically adjust the regularization level based on the model’s convergence state.

To achieve this, we implemented a criterion that dynamically adjusts the augmentation strategy every iteration by introducing a novel loss function that evaluates the convergence state of the model during training and keeps track of the adjusted augmentation intensity for future retraining. In addition, to ensure the exploratory nature of the learnable augmentation strategy, the hyperparameters required for data augmentation are all sampled from the Gaussian Distribution based Data Augmentation Sampler(DAS). We utilize DAS to acquire dynamic data enhancement strategies. The entire training process adopts an end-to-end training method without additional computational overhead.

For the specific data augmentation strategies, we primarily focuses on the dynamic implementation of commonly employed static data augmentation techniques, including mixup-based methods, RandAugment method, and Drop method. The intensity of these three methods is controlled by adjusting specific parameters. For the mixup-based method, the ratio is modified to regulate the intensity. In the case of RandAugment, the magnitude is adjusted to control the intensity. Lastly, for the Drop method, the drop ratio in each Transformer block is manipulated to regulate the intensity. By dynamically modifying the augmentation intensity, these methods are effective in addressing challenges related to overfitting and limited training data, thereby facilitating optimal model training.

We evaluate DynamicAug on 27 classification datasets [20,21,22,23,24] in total and conduct extensive experiments on transfer learning tasks, including different pretrained weights tasks, different augmentation method tasks and different transfer learning models. In addition, DynamicAug requires few additional parameters and is easily extended to various model families. Figure 1 demonstrates the disparity between DynamicAug and fixed data augmentation methods. Furthermore, we present sufficient experimental outcomes to assess its overall effectiveness. It is noteworthy that, upon optimization through dynamic data augmentation, the traditional fine-tuning approach can outperform the current state-of-the-art (SOTA) fine-tuning method SPT [25]. In summary, the contributions can be summarized as follows:

  1. 1.

    We innovatively propose a dynamic model-aware augmentation approach. This approach can dynamically and adaptively adjust the data augmentation intensity based on the model’s convergence state. The model achieved through dynamic data augmentation fine-tuning demonstrates superior performance compared to the model obtained through static data augmentation fine-tuning.

  2. 2.

    The proposed DynamicAug method is not restricted to the modified LoRA [26], Adapter [27] and VPT [10] methods. It can be seamlessly combined with the mainstream fine-tuning methods.

  3. 3.

    Our experiments validate that DynamicAug is a valuable supplement to current fine-tuning strategies and significantly improves model performance. For instance, DynamicAug improves the LoRA fine-tuning method by achieving an average accuracy increase of + 1.7% on VTAB-1k, which even surpasses that of the best fine-tuning architecture searched in NOAH [13] and SPT [25].

2 Related Work

2.1 Data Augmentation

Data augmentation is a technique that artificially generates additional data while preserving the original data distribution within the training set [28]. Commonly employed techniques for data augmentation include mixup-based methods [15, 16], RandAugment [17], among others. Moreover, Drop method [18, 29] can also be regarded as an effective data enhancement strategy.

Drop: Methods like Drop achieve improved model performance or robustness by randomly deactivating parts of the model’s structure during training. For instance, in the dropout method [18], certain neurons stop working with a specific probability during the forward propagation process in the fully connected layer. This process significantly improves the model’s generalization and reduces its sensitivity to local features. Likewise, in the drop path method [29], random paths in the multi-branch network structure are deactivated, removing the dependence of weight update on the joint action of the fixed relationship between the various branches. Typically, these methods are included during model training as a means of regularization to prevent the model from overfitting.

RandAugment: RandAugment [17] is an improvement upon AutoAugment [30], where it maintains the probabilities of all image processing methods while varying only the number and intensity of the image processing types. These processing methods mainly include 14 different transformations such as identity, autocontrast, among others. Notably, the RandAugment algorithm primarily consists of three hyperparameters: magnitude m, standard deviation of magnitude M, and the number of transformations N. Increasing the value of M and N enhances the intensity of data augmentation, while the standard deviation of magnitude is often assigned a value of 0.5.

CutMix and Mixup: Mixup [15] and CutMix [16] are data augmentation methods that fuse different parts of images to generate new training samples. The Mixup method randomly selects two samples and performs a linear weighted summation. The CutMix technique introduces a novel methodology for merging parts of an image by cutting and pasting them onto another image, thereby accomplishing image merging in a region-based manner. New samples generated using the Mixup and CutMix methods help compensate for a lack of training image datasets and expand the training data to some extent. Given two input images, \(x_A\) and \(x_B\), with their corresponding labels \(y_A\) and \(y_B\). Mixup generates a new mixed image \({\hat{x}}\), using a linear combination of the pixel values from A and B. This can be described as

$$\begin{aligned} {\hat{x}} = \lambda x_A + (1-\lambda ) x_B, \quad {\hat{y}} = \lambda y_A + (1-\lambda ) y_B. \end{aligned}$$
(1)

where \(\lambda \) obeys the Beta distribution with the parameter 0.8. And for Cutmix, the processing can be fomulated as

$$\begin{aligned} {\hat{x}} = M x_A + (1-M) x_B, \quad {\hat{y}} = \lambda y_A + (1-\lambda ) y_B, \end{aligned}$$
(2)

where \(M \in \{0,1\}^{W \times H}\) represents a binary mask associated with \(\lambda \), and W and H denote the width and height of the input images, respectively. \(\lambda \) here obeys the Beta distribution with the parameter 1.0.

These techniques have proven to be highly effective at incorporating richer variations in the training data, while simultaneously preserving local spatial information. Overall, data augmentation aims to enable the model to learn the characteristics of the data in a more comprehensive, multi-faceted manner instead of relying on one-sided information.

2.2 Transfer Learning

Pre-training followed by fine-tuning is the most frequently utilized transfer learning approach [10, 26, 27, 31,32,33]. The utilization of pre-training and fine-tuning approaches in the field of computer vision has consistently demonstrated remarkable empirical performance across various tasks. While fine-tuning pre-trained models offers notable efficiency and performance advantages, the practicality of fine-tuning the entire model remains limited in many scenarios. Hence, an innovative approach involves replacing full fine-tuning with the selective tuning of only a few trainable parameters, while keeping the majority of parameters shared across multiple tasks frozen. This approach typically fine-tunes less than 1% of the parameters of the complete model, yet its overall performance can be equivalent to or surpass that achieved through full fine-tuning. By significantly reducing the storage burden and making optimization less challenging, tuning fewer parameters when adapting pre-trained models to target datasets becomes more feasible. As a result, this approach can achieve comparable or even superior performance to that of full fine-tuning. The current mainstream parameter efficient tuning methods can be broadly categorized into two, namely addition-based parameter tuning methods [10, 27, 31] and reparameterization-based tuning methods [34,35,36].

In the addition-based parameter efficient tuning approach, extra trainable parameters are appended to the backbone model, and solely these supplementary parameters are fine-tuned during the adaptation process. The existing parameters of the backbone model remain fixed throughout the adaptation process. For instance, two notable examples of addition-based parameter efficient tuning methods are the Visual Prompt Tuning (VPT) algorithm [10] and the Adapter algorithm [27]. VPT employs a technique where a group of learnable tokens is placed at the beginning of the input sequence for each Transformer block. And the input of VPT can be represented by \(x_{vpt}=[\text {x}_{cls},\text {P}]\), where \(x_{cls}\) represents the class token and P represents the m added learnable tokens and \(\text {P} = \{ \text {p}_k | \text {p}_k \in {\mathbb {R}}^D, k = 1,\ldots , m \}\). D is the dimension of each token. Adapter is a lightweight network after the MLP layers in Transformer and it can be fomulated as

$$\begin{aligned} x = Linear_{up}(QuickGELU(Linear_{down}(x))), \end{aligned}$$
(3)

where \(Linear_{up}\) and \(Linear_{down}\) are constructed with a small number of parameters, alongside QuickGELU which functions as the activation function.

The reparameterization-based methods adjust parameters that are either inherent to the backbone or can be reparameterized within it during the inference process. For example, LoRA [26] add the additioned parameter \(\Delta W\) for Q and K in each Transformer block, where \(\Delta W\) is composed of a matrix \(A_{q/k} \in {\mathbb {R}}^{r \times D}\) for ascending dimensions and a matrix \(B_{q/k} \in {\mathbb {R}}^{D \times r}\) for descending dimensions. r is the down-projection dimension (generally set to 8 or 16). During training, only A and B require iterative updates, while the other parameters of the model remain frozen. The forward process of Q and K can be formulated as

$$\begin{aligned} Q = x W_q + x B_q A_q, \quad K = x W_k + x B_k A_k. \end{aligned}$$
(4)

3 Approach

Fig. 2
figure 2

Overview of our DynamicAug method. On the basis of the existed transfer learning methods, we dynamically update the intensity of data augmentation for fine-tuning. we first replace the conventional data augmentation methods with dynamic paradigms and incorporate additional parameters to update the intensity of data augmentation. Therefore, the \(\alpha \) is a learnable hyperparameter embedded in model. In order to maintain the exploratory nature of \(\alpha \), we introduce the Gaussian Distribution based Data Augmentation Sampler (DAS) for sampling and use the augmentation policies sampled from DAS for fine-tuning. Finally, we evaluate the convergence state of the model by a new introduced loss and then update \(\alpha \). We named it convergence loss in Eq. 9

In this section, we will begin by introducing DynamicAug, a dynamic optimization method that extends the traditional augmentation approach with minimal additional parameters. We show the overview of our method in Fig. 2. To preserve the exploratory nature of the learnable hyperparameters, we have incorporated a Gaussian Distribution based Data Augmentation Sampler for sampling. We leverage this sampler to sample an augmentation strategy, which is then used for fine-tuning purposes. Finally, to update the augmentation strategy based on a reasonable evaluation of the model’s convergence status, we propose a novel loss function referred to as the convergence loss.

3.1 Dynamic Data Augmentation

Models can exhibit distinct convergence states and therefore require appropriate training strategies. Fixed strategies used in previous studies may lead to insufficient training or overfitting under certain conditions [26, 27]. However, a suitable strategy can also aid in the training process. For instance, studies have demonstrated that applying the drop method only during the initial stages of training can improve model fitting to data compared to using it throughout the entire process in certain scenarios [37]. To comprehensively tackle this issue, we introduce the adaptive data augmentation method called DynamicAug. The purpose of DynamicAug is to determine the intensity of data augmentation during fine-tuning.

For Drop. The traditional drop method of the forward pass for each Transformer block can be generally expressed as follows:

$$\begin{aligned} y_{i} = Drop Path(Block_{i}(x_{i}), r_{i}), \end{aligned}$$
(5)

where \(x_i\) and \(y_i\) represent the inputs and outputs of the i-th ViT block. The drop ratio of the corresponding block is denoted by \(r_i\), which increases or decreases as the depth i increases, like [38]. But in DynamicAug, we establish parameters for \(r_i\) and update them within each Transformer block during the iteration. And Eq. 5 can be rewrite as:

$$\begin{aligned} y_{i} = DynamicDrop(Block_{i}(x_{i}), p^i_{d}), \end{aligned}$$
(6)

\(p^i_{d}\) here is a dynamic drop ratio that updated with the improvement of model. We added a learnable hyperparameter \(\alpha ^i_{d}\) for each \(p^i_{d}\). During the fine-tuning process, we will sample the \(p^i_{d}\) by \(\alpha ^i_{d}\) before each iteration through a truncated normal sampler, and constantly update \(\alpha ^i_{d}\) according to the model convergence state.

For Mixup and Cutmix. The conventional mixup-based methods involve mixing the values and labels of two images using a linear ratio \(\lambda \). And the prevalent training paradigm involves alternating between mixup and cutmix methods, with a default probability of 0.5 for switching between them. Nonetheless, it should be noted that regardless of the approach employed, the parameter \(\lambda \) serves as the governing factor for controlling the intensity of data augmentation. Consequently, during the fine-tuning process, we dynamically adjust the value of \(\lambda \) with \([p_{m},p_{c}]\). To learn its optimal value based on the convergence state, we add two parameters \([\alpha _{m},\alpha _{c}]\) for \([p_{m},p_{c}]\).

For RandAugment. During the implementation of RandAugment, a predetermined set of transformation operations is randomly selected. Subsequently, these operations are applied to the samples in accordance with preset hyperparameters as

$$\begin{aligned} x = f_{p_1,m}(\ldots (f_{p_n,m}(x))\ldots ), \end{aligned}$$
(7)

where m denotes the magnitude of the transformation and n signifies the number of transformations applied. p represents the specific transform operation.

In DynamicAug, we dynamically update the value of m with \(p_r\) and sample the \(p_r\) with \(\alpha _r\) to control the intensity of data augmentation. Hence, RandAugment requires only one parameter to achieve fine-tuning dynamics.

3.2 Gaussian Distribution Based Data Augmentation Sampler

Algorithm 1
figure a

Updating \(\alpha \)

The Gaussian Distribution based Data Augmentation Sampler (DAS) serves the purpose of introducing noise and enhancing the exploratory nature of the data augmentation hyperparameters. Our preference is for the model to prioritize capturing the trend of changes in augmentation strategies rather than specific hyperparameter values.

So in DynamicAug, we randomly sample the data augmentation strategies p from the DAS \(\Psi \) with the learnable expectation \(\alpha \) for fine-tuning in real time. The DAS can be formulated as:

$$\begin{aligned} p_i \sim \Psi (\alpha _i, \sigma ^2, a_i, b_i), \end{aligned}$$
(8)

where \(\sigma ^2\) is a fixed variance. \(a_i\) and \(b_i\) refer to the left and right borders for the truncated distribution, respectively. For the strategies p, we have \(p=[p^1_{drop},\ldots ,p^{i}_{drop}]\) for the dynamic drop strategies in Transformer models with the depth i, \(p=[p_{m},p_{c}]\) for the dynamic mixup-based methods and \(p=[p_r]\) for the dynamic RandAugment method. Obviously, DynamicAug introduces very few additional parameters. To update \(\alpha \) end to end, we use Straight-through estimator (STE) [39] to incorporate \(\alpha \) into the forward process. Details can be seen in Algorithm 1. Mixup-base methods and Randaugment share a similar characteristic as they operate prior to the samples entering the model. In contrast, Drop functions during the whole forward process of the model. Therefore, the methods by which they are updated also diverge. Regarding Drop, we incorporate the corresponding parameter \(\alpha _i\) in each block during the forward process. For the other two methods, \(\alpha \) is multiplied at the end of the model’s forward process to assess the state of the model.

To provide further evidence, we conducted ablation studies comparing the use of DAS to not using one. The results demonstrate a significant decrease in model performance when a sampler is not utilized. Moreover, recent studies have also employed samplers to obtain corresponding strategies [40].

Since \(\alpha \) is a hyperparameter determined by the convergence behavior of the model, its optimization process differs from that of the model parameter w. Therefore, we introduced a new loss function specifically for optimizing \(\alpha \).

3.3 Loss for Model Convergence State

Since obtaining the model’s convergence state directly on the evaluation set is still a challenging problem, we have proposed an alternative approach to simulate the convergence state and subsequently update the \(\alpha \) parameter. Our approach centers on monitoring the alteration in model loss and entails reserving a designated number of samples from the training set to construct a validation set that emulates the evaluation data. Specifically, we have introduced a novel loss function, referred to as the Convergence loss, which serves as the optimization objective for \(\alpha \),

$$\begin{aligned} L_{cvg} = \Delta L_{val}(w \mid \alpha ,D_{val}). \end{aligned}$$
(9)

In the proposed loss function, \(L_{val}\) represents the Cross-entropy loss and w represents the common model weight. During the actual training process, we can estimate the change in loss by measuring the difference between iterations. Subsequently, Eq. 9 can be approximated as

$$\begin{aligned} L_{cvg} = L_{val}(w^t \mid \alpha ,D_{val}) - L_{val}(w^{t-f} \mid \alpha ,D_{val}), \end{aligned}$$
(10)

where f is the update frequency of \(\alpha \) and t is the current training iteration.

Generally, a larger negative incremental loss indicates a greater potential for improvement in model performance, suggesting that the model has not yet converged. Therefore, our ultimate optimization goal can be expressed as follows:

$$\begin{aligned} {\arg \min } L_{cvg}(\alpha \mid D_{val}, w^{t}, w^{t-f}) \end{aligned}$$
(11)

To update \(\alpha \), we can obtain its gradient \(\nabla _{\alpha }\) by storing the gradient in the previous iteration and then again in the subsequent f iteration.

$$\begin{aligned} \nabla _{\alpha } L_{cvg} = \nabla _{\alpha } L_{val}(w^t \mid \alpha , D_{val}) - \nabla _{\alpha } L_{val}(w^{t-f} \mid \alpha , D_{val}) \end{aligned}$$
(12)

During the update of \(\alpha \), we explicitly set the drop ratio to 0 or turn off the mixup method and RandAugment method in order to assess the model’s performance in the evaluation state. Within the set of partitioned validations, \(\alpha \) plays a crucial role in the forward process of each block, and its gradient is maintained through Straight-Through Estimator (STE) during the backward process. We posted the pseudocode of \(\alpha \) update in Algorithm 1. In the implementation, the augmentation parameter \(\alpha \) is updated only every few iterations, minimizing any potential computational overhead.

Indeed, it is crucial to recognize that while the augmentation parameters may not exert a noticeable impact on the loss during individual training iterations, it does have a subtle influence on the overall training process over an extended period of time.

4 Experiments

In this section, we begin by outlining our experimental setup, which encompasses the datasets, baseline methods, and implementation details employed in our study. We then proceed to showcase the effectiveness of DynamicAug across multiple mainstream transfer learning tasks. Furthermore, we conduct ablation experiments to evaluate the impact and efficacy of DynamicAug in comparison to other approaches. Lastly, we delve into in-depth analyses to enhance our comprehension of the role of augmentations in transfer learning tasks.

4.1 Experimental Settings

Datasets. Our experiments primarily utilize three distinctive types of datasets. (1) We employ the VTAB-1k [24] dataset, which serves as a benchmark for transfer learning in visual classification tasks. This dataset includes 19 classification tasks that are categorized into three domains: (i) natural images captured by standard cameras; (ii) professional images captured by non-standard cameras, such as remote sensing and medical cameras; (iii) structured images synthesized from simulated environments. The benchmark consists of various tasks, including object counting and depth estimation, from diverse image domains. Due to the presence of only 1000 training samples per task, the dataset’s high level of complexity makes it extremely challenging for training benchmarks. (2) Another benchmark we utilize is Fine-Grained Visual Classification (FGVC), which focuses on fine-grained visual classification tasks. This benchmark comprises several datasets, including the Stanford Dogs [20], Oxford Flowers [21], NABirds [22], CUB-200-2011 [23] and Stanford Cars [41]. Each FGVC dataset consists of 55 to 200 categories and several thousand images for training, validation, and testing. In cases where the validation sets are not provided, we follow the specified validation split as indicated in [13]. (3) For the few-shot tasks, we select five fine-grained visual recognition datasets, namely Food101 [42], OxfordFlowers102 [43], StandfordCars [44], OxfordPets [45], and FGVCAircraft [46]. These datasets comprise categories that depict a diverse range of visual concepts closely associated with our daily lives, such as food, plants, vehicles, and animals. To assess the effectiveness of our approach, we follow previous studies [33, 47] and evaluate the performance on 1, 2, 4, 8, and 16 shots, which are adequate for observing the trend.

Baselines. Based on [25], our main experiments are conducted using a Vision Transformer backbone ViT-B/16, which is pretrained on ImageNet-21K. We incorporate the DynamicAug method into LoRA [26], Adapter [27] and Prompt-deep [10] to further enhance their performance. During the fine-tuning process, these three methods exclusively utilize the Drop data augmentation technique. To ensure a fair comparison, we conduct separate fine-tuning processes for the Mixup and Randaugment methods, thereby establishing baselines specific to these two data augmentation techniques. Furthermore, we perform additional experiments on the ViT-L/16 backbone to demonstrate the effectiveness of the DynamicAug method on larger models. In terms of training strategies, we explore both supervised pre-training and self-supervised pre-training techniques. Specifically, we employ the techniques of MAE [48] and MoCo v3 [49] to train our models. Finally, to assess the generalizability of DynamicAug, we incorporate it into the CLIP language-visual model [50, 51] in the appendix. This evaluation allows us to examine the effectiveness and applicability of the DynamicAug method beyond just visual tasks, extending its potential to language-visual tasks as well.

Implementation Details. Following [13], we utilize the AdamW optimizer [52] with cosine learning rate decay for our experiments. Specifically, for the VTAB experiments, we set the batch size to 64, the learning rate to \(1\times 10^{-3}\), and the weight decay to \(1\times 10^{-4}\). To ensure fairness, we follow the standard data augmentation pipeline [13]. To update the augmentation parameters, we initialize the number of validation samples to a certain ratio, such as 0.1, 0.2, or 0.4. The augmentation strategies is updated either every 5 iterations or the maximum number of iterations remaining for the image after segmentation. To optimize the performance of DynamicAug, we conduct a grid search on the hyperparameters involved. Importantly, DynamicAug only operates during training and is turned off during testing or verification stages. Therefore, it does not increase the model’s inference time, ensuring efficient and practical implementation.

Table 1 Comparisons between the traditional data augmentation methods and DynamicAug on VTAB-1k [24] benchmarks using supervised pre-trained ViT-B/16 backbone pre-trained on ImageNet-21k
Table 2 Comparisons on FGVC and VTAB-1k [24] benchmarks using supervised pre-trained ViT-B/16 backbone pre-trained on ImageNet-21k

4.2 Main Results

Experiments on VTAB-1k. We first choose the VTAB-1k [24] benchmark to evaluate the proposed DynamicAug method. Table  1 demonstrates that the benchmark model, which utilizes DynamicAug, outperforms the original methods of data augmentation in both addition-based and reparameterization-based approaches. The abbreviations FD, FM, and FR correspond to Fixed Drop, Fixed Mixup, and Fixed Randaugment respectively. Similarly, DD, DM, and DR represent Dynamic Drop, Dynamic Mixup, and Dynamic Randaugment respectively. In the three fine-tuning methods LoRA [26], Adapter [27], and Prompt-deep [10], the DD (DM, DR) method shows a superiority of 1.7% (1.1%, 0.9%), 1.2% (0.7%, 0.6%), and 2.1% (2.7%, 2.8%) over the FD (FM, FR) method, respectively.

Remarkably, LoRA-DD outperforms the NOAH [13] and SPT [25] SOTA method with just 12 trainable parameters (equivalent to ViT-B/16 depth) in Table  2, which means that for a model, the training strategy’s impact on model performance may be greater than that of architecture sometimes.

Self-supervised learning strategies constitute an exceedingly crucial component of deep learning [48, 49, 53]. However, previous efficient fine-tuning methods have shown inconsistent results when applied to backbones with different learning strategies. To validate the effectiveness of DynamicAug under different pre-training strategies, we conducted experiments on MAE and MoCo v3 pre-trained backbones. The results are shown in Table  3. LoRA-DD achieves remarkable 1.2% and 2.1% mean top-1 accuracy gains over baseline method on VTAB-1k benchmark and obtains the state-of-the-art results. Due to the wider scope of application of Drop, we will mainly focus on the Drop strategy in subsequent experiments.

It should be noted that the 19 VTAB-1k datasets are incredibly small, comprising only 800 to 1000 samples per dataset. Dividing the dataset to evaluate the drop policy can have a considerable negative impact on the model’s performance. Consequently, after obtaining the drop strategy, we have to retrain the model using the full dataset.

Table 3 Comparisons on VTAB-1k [24] benchmark using self-supervised ViT-B/16 backbone pre-trained by MAE[48] and MoCo v3[49]

Experiments on FGVC. We conducted the experiments on five fine-grained datasets for FGVC. As shown in Table 2, LoRA-DD outperforms the LoRA base by a clean margin of 1.2% mean top-1 accuracy. The results shows that DynamicAug can also perform well in fine-grained tasks.

Experiments on Few-Shot Transfer Learning. Due to the particularity of the dataset, during the few-shot experiment, we divided the dataset proportionately into (0.1, 0.2, 0.4) for evaluation purposes. As depicted in Fig. 3, after applying the DynamicAug method (blue line), the average accuracy of LoRA significantly improved compared to using static drop (orange line). Furthermore, the average results were found to be on par with the current SOTA method, NOAH (green line), and surpassed it in the low-data regime of 1-shot, 2-shot, and 4-shot. Notably, the performance of LoRA-DD in the FGVCAircraft dataset is outstanding. Although the other datasets did not display the same level of improvement, most of them outperformed LoRA with static drop.

Fig. 3
figure 3

The experimental results of few-shot transfer learning on five fine-grained visual recognition datasets after applying DynamicAug to LoRA

Table 4 Ablation experiments for the effectiveness of DynamicAug strategy

4.3 Ablation Study

Effect of the DynamicAug. To assess the effectiveness of DynamicAug on VTAB-1k, we utilized different static drop rates as baselines and compared the dynamic drop strategy’s efficacy with the static drop strategy. The findings are presented in Table  4. On the original baseline with a 0.1 drop rate, the DynamicAug method increased the accuracy rate by 1.5%, 0.9%, and 2.8% in the Natural, Specialized, and Structured groups, respectively.

Although varying drop rates can impact model training to a certain extent, it becomes evident that static drop offers suboptimal training strategies for the model. The DynamicAug strategy is capable of compensating for insufficient or excessive training to a certain degree. Simultaneously, the greedy update strategy can expedite the model’s convergence process to a certain extent.

Effect of Gaussian Distribution based Data Augmentation Sampler. As mentioned previously, after the initial training with the incomplete dataset, it is necessary to conduct retraining using the complete dataset. However, there exists a gap in the quantity of strategies obtained through the sampler during training and the strategies applied during the retraining process.

To address this issue, we examined the drop strategies generated with different approaches during the initial training process. We show the detailed drop curves in appendix and the ablation experiments in Table  5. (first, last, ave) in the table respectively represent the strategies generated during the first update, the last update in each epoch, and the average of the strategies within that epoch in the initial training. Therefore, we conducted ablation experiments on these three strategies by applying the (first, last, ave) strategy during each epoch of the retraining process. The results demonstrate that the model primarily focuses on the trend of regularization level changes rather than specific values. So we use the last drop policy of each epoch in retraining.

In addition, the Gaussian Distribution based Data Augmentation Sampler is also capable of introducing noise to drop and preventing the updates of drop from falling into the Matthew effect. We provided experimental results comparing the use of the normal sampler and not using it in Table 5. The results indicate that not using a sampler leads to a noticeable decline in model performance.

Table 5 Ablation study on drop strategies obtained through different strategy acquisition methods

Effect of the number of image divisions and retraining. The accuracy of evaluating model convergence is directly impacted by the number of image divisions employed. Gradually, the evaluation of convergence will affect the optimization direction of the drop. The divided training set and verification set should be able to fit the training process while evaluating the convergence degree of the model as accurately as possible.

As a result, we split a validation set from the training set with the ratio of 0.1, 0.2 and 0.4 for evaluating on VTAB-1k and the results are shown in Table  6. In addition, we counted the experimental results without retraining and marked it as LoRA-DD-without retraining. Experiments show that a split ratio of 0.1 is sufficient to evaluate the convergence state of the model. Retraining again improves the effectiveness of the DynamicAug method based on the first training.

Table 6 Ablation study on the effect of image divisions and retraining

Comparison on the manually scheduled drop and DynamicAug. Due to the successful application of manually designed learning rate schedules in deep learning, such as cosine and linear learning rates, we similarly designed a drop ratio schedule to verify whether artificially designed dynamic drop is feasible. We set the minimum and maximum values of drop to 0 and 0.1, respectively, and increased the drop value using linear and cosine methods as the number of epochs increased. The term “reversal” here means that the initial value of drop is set to 0.1 and it decreases as the number of epochs increases. As shown in Table  7, due to the unique characteristics of the drop parameter, it is challenging for it to exhibit similar behavior to the learning rate. Therefore, manually setting a dynamic drop schedule can lead to a significant decrease in model performance. This also highlights the importance of DynamicAug.

Table 7 Comparisons between the manually designed drop ratio with DynamicAug
Fig. 4
figure 4

Results of loss difference for both fixed regularization strategy and dynamic regularization strategy acquired by DynamicAug on VTAB-1k. LoRA-DD refers to applying DynamicAug method to LoRA

4.4 Impacts of DynamicAug on Transfer Learning

Data augmentation method is a crucial component in most deep learning tasks, serving as a regularization method during model training to mitigate overfitting on local features and focus on global features instead. To better evaluate the impact of the DynamicAug method on transfer learning tasks, we painted the drop values for the three VTAB-1k dataset groups: Natural (7), Specialized (4) and Structured (8), and plotted the results in appendix. It is evident that different datasets require distinct optimization strategies. Additionally, we provide the loss differential figures for both fixed regularization strategy and dynamic regularization strategy acquired by DynamicAug. From Fig. 4, it is observable that the model fell into overfitting between the 30th and 40th epochs with the static data augmentation strategy, indicated by a positive loss difference. In contrast, DynamicAug continues to facilitate model training during the same period. In the end, we evaluated the generalizability of the DynamicAug method by incorporating it into the CLIP language-visual model in appendix.

5 Conclusion

In this study, we primarily investigate the impact of the dynamic data augmentation strategy on transfer learning tasks. Specifically, we propose a novel model-aware based DynamicAug strategy that continuously adjusts the intensity of data augmentation based on the model convergence state. It is worth mentioning that our approach is not restricted to the LoRA, Adapter and VPT fine-tuning methods that we have modified, as long as the model involving the data augmentation method can be improved with DynamicAug.

However, the convergence curves of different models on diverse datasets can vary significantly. As a result, it is challenging for our method to provide a universal drop strategy that applies to all models and datasets. Additionally, due to the evaluation requirements, we divided the dataset, resulting in a lack of training samples in the the first fine-tuning stage. Consequently, after obtaining the drop strategy, the model needs to be retrained.