1 Introduction

Few-shot learning (FSL) is enlightened by the memory mechanism of the human brain. It hopes that the deep learning model can quickly learn new categories after learning a certain number of different categories of data, which is the conundrum that FSL wants to solve. As deep learning techniques improve by leaps and bounds and long-term research by researchers, FSL methods have made significant progress, such as MAML [1], ProtoNet [2], RelationNet [3], MatchingNet [4], and GNN [5]. Typically, the application of FSL methods is limited to restrictive settings where the base classes data for training and novel classes data for testing come from the same domain (or data distribution). Obviously, this setup is impractical. But when the samples in the source and target domains come from different data distributions, the existing FSL methods will lose their excellent performance. This motivates cross-domain few-shot learning and has attracted extensive attention from a large number of researchers [6,7,8,9].

Cross-Domain Few-Shot Learning (CD-FSL) is further developed from few-shot learning, breaking the restrictive hypothesis that the base classes data and the novel classes data are sampled from the same domain in FSL. In other words, CD-FSL faces challenges in simultaneously handling FSL and domain adaptation (DA) tasks. Both FSL and DA are essentially transfer learning problems. FSL is usually stated as a transfer learning process from base classes to novel classes. Simultaneously, domain adaptation is usually described as a transfer learning task from source domain to target domain. A comparison of the similarities and differences between FSL, DA, and CD-FSL is shown in Fig. 1.

The main challenges of CD-FSL are narrowing the domain shift between the two domains and enhancing the generalization of the network. At present, there are two major approaches to the study of CD-FSL. The first method only uses the samples in the source domain but cannot access the samples in the target domain during training and relies on adversarial training or decoupled learning to extract discriminative feature representations. For example, ATA [10] believes that the inductive bias obtained by training based on the base classes is not robust enough, so adversarial task enhancement is used to enhance the robustness of cross-domain inductive bias. CHEF [11] adopts feature fusion to unify feature representations at different levels of abstraction into a single representation to achieve useful feature sharing between source and target domains.

Although the above methods can improve the model’s generalization ability, they can easily lead to equivocal optimization results, owing to their neglect of the adaptation process of the target domain. Hence, the improvement of CD-FSL task performance is still very limited. The second approach assists model training by utilizing additional data from the target domain. For instance, NSAE [12] further fine-tunes their model in a self-supervised way by utilizing a mass of unlabeled samples in the target domain, while study [13] achieves excellent results with very little labeled target samples. Given that the cost of acquiring very few labeled data is acceptable in practice, we also follow this direction. Notably, the images in the introduced auxiliary dataset are disjointed from the target test data, only participate in the training process, and do not appear in the inference process, so this operation does not violate the problem setup in the CD-FSL task.

Fig. 1
figure 1

Illustrating the differences between few-shot learning, domain adaptation, and cross-domain few-shot learning. In FSL, the samples in the source and target spaces are in the same domain, while their category spaces are non-overlapping, and the amount of available samples per category in the target space is tiny. In DA, the source and target spaces are from different domains, but their category spaces are the same. The CD-FSL task integrates the characteristics of DA based on FSL

To overcome the limitations of previous CD-FSL work, we conduct the following three aspects. Firstly, to make the utmost of the newly introduced target samples, we use the mixup [14] technique to mix images in the source domain with images in the target domain, which greatly augments the target samples. Secondly, it is different from the traditional CD-FSL approaches that adopt feature mapping or augmentation to simulate the data distribution change in the target domain. A domain-specific adaptor is proposed to accommodate the target domain’s data distribution to alleviate the situation of inadaptation when the knowledge learned in the source domain is migrated to the target domain and further improves the model’s generalization ability. Finally, since CD-FSL faces multiple challenges, multiple optimization tasks (loss functions) are required to constrain the model to achieve good generalization results jointly. However, in previous CD-FSL work, multiple losses were simply added, which would restrict the convergence direction of the model and affect the performance of the model. Therefore, we propose an adaptive loss optimization method for network optimization.

Formally, we propose a novel Target Oriented Dynamic Adaption (TODA) framework in this work. Specifically, source and target episodes are randomly sampled from the source and the target datasets during each training iteration process. Next, the query sets of the two episodes obtained in the previous step are mixed to form a mixed query set. Then, the feature extraction network with domain-specific adaptors and decoupling module extracts domain-invariant features and domain-specific features for the support set in the source episode, the support set in the target episode, and the mixed query set, respectively. Afterward, the few-shot learning classification task and the domain classification task are executed, where the domain classification task is further divided into the domain discrimination task and domain confusion task, and these three tasks are performed simultaneously. Finally, these three tasks are sent to the adaptive optimization module, and different weight coefficients are assigned to different tasks according to their importance so that the model converges quickly and the model’s performance is further improved.

The main contributions of this work are summarized as follows.

  • We propose a novel Target Oriented Dynamic Adaption (TODA) model. The TODA model achieves good generalization ability in the CD-FSL task by utilizing the target data to guide the model to adapt to the data distribution in the target domain.

  • We design a domain-specific adaptor, which is integrated into the backbone network in the meta-training stage and trained by our training strategy, so that the extracted features can be more specific to the task in the target domain, and then tackle the problem of inadaptability of the CD-FSL classification model in the transfer to the target domain.

  • We also propose an adaptive optimization module, which adaptively allocates different weights to different tasks on the grounds of their different importance to the model in the optimization process. Meanwhile, we evaluate our method on four benchmark datasets, and the experimental results exhibit that the TODA method is more competitive compared with the current CD-FSL method.

2 Related Work

2.1 Few-Shot Learning

Few-shot learning aims to build cognitive ability for new concepts from few instances quickly. Many achievements have been achieved in the field of FSL through the continuous efforts of researchers. The FSL methods are roughly classified into three directions: model-based initialization [1, 15, 16], metric learning [2,3,4,5] and data augmentation [17,18,19]. Recently, DN4 [20] performs classification by computing the similarity between the input samples and the local descriptors of each category, EASE [21] learns a discriminative ability space in an unsupervised way, and TransVLAD [22] proposes a Transformer framework for local aggregation descriptions, etc. One commonality of the above methods is that they acquire training and testing samples from the same data distribution. Research [23] pointed out that when the samples in the source domain and the target domain are sampled from disparate datasets, the existing FSL methods cannot achieve excellent performance in the face of new target classes. In this paper, we adopt GNN networks as the fundamental elements of our network and use them in cross-domain scenarios.

2.2 Domain Adaptation

Domain Adaptation (DA) is designed to minish the gap between two domains, so as to apply the knowledge acquired in the source domain to the target domain. The target and source domains have the same category space, while the data distribution is different between the two domains. Domain adaptation methods based on deep learning are broadly split into three types: distribution difference-based methods [24, 25], adversarial-based methods [26], and reconstruction-based methods [27]. In recent years, unsupervised domain adaptation (UDA) has occupied the mainstream position in DA research. Traditional UDA methods [28, 29] usually utilize the technique of subspace alignment, while numerous recent UDA methods [30, 31] are inclined to adversarial learning. Domain adaptation methods cannot be directly used in CD-FSL tasks since the category spaces of source and target domains are disjoint in CD-FSL, and enough unlabeled or labeled target samples can be used in domain adaptation tasks. However, CD-FSL and DA face some common challenges, so some brilliant ideas in DA can be introduced into CD-FSL methods to achieve better network performance. For example, the TSVR method proposed by Lv et al. [9] to apply DA in FSL achieves good results.

2.3 Cross-Domain Few-shot Learning

Under the strong promotion of research [23], Tseng et al. [6] formally proposed and defined cross-domain few-shot learning tasks. Then, Guo et al. [7] raised a new extensive study along with a new benchmark. CD-FSL methods are divided into two groups on the basis of whether the target dataset can be used in the training phase. For the case where the target data is not used, ATA [10] employs the strategy of adversarial training to enhance the robustness of inductive bias, thereby improving the model’s performance in the test environment with significant differences from the training data. In Wave-SAN [32], styles are considered to contain domain information. Therefore, they switch styles between training episodes by self-supervised learning so that the model ignores style shifts. Typically, since the target data is not available, the performance of such methods in model training is lower than that of methods using target data. Therefore, some researchers have tried fine-tuning their models by exploiting data in the target domain. For instance, NSAE [12] enhances features with noise, and ConFT [33] applies contrastive loss. Both studies ended up fine-tuning their models using support data in the target domain. Liu et al. [34] and Yao [35] further introduces unlabeled target data into the model training stage. In general, the above methods take considerable time during training because they have to fine-tune the network on per test set or need a mass of unlabeled target data to assure the network’s performance. Therefore, in this study, compared to the time-consuming strategy of some CD-FSL methods that need to use a massive amount of unlabeled target data, we are consistent with the setting in [13], assuming that only small amounts of labeled samples from the target domain are accessible. This reduces training time and data requirements while achieving better results in terms of performance.

3 Methodology

3.1 Problem Definition

Assume a source dataset \(D_{sou}=\left\{ x_{i},y_{i} \right\} _{i=1}^{ns}\) with abundant labeled samples and a target dataset \(D_{tar}=\left\{ x_{j},y_{j} \right\} _{j=1}^{nt}\) with a small amount of labeled samples, where \(x_{i}\) denotes the i-th sample and \(y_{i}\) denotes the labels of the i-th sample. The classes of the two datasets are disjoint. In standard few-shot learning, \(D_{sou}\) and \(D_{tar}\) are obtained from the same domain. However, in the case of CD-FSL, \(D_{sou}\) and \(D_{tar}\) are sampled from different domains. Under the CD-FSL setting proposed in FWT [6], source base dataset \(Ds_{b}\), target base dataset \(Dt_{b}\), and target novel dataset \(Dt_{n}\) can be further obtained. which are three sub-datasets further sampled from \(D_{sou}\) and \(D_{tar}\). Specifically, \(Ds_{b}\) is randomly sampled from \(D_{sou}\), and \(Dt_{b}\) and \(Dt_{n}\) are randomly sampled from \(D_{tar}\). It must be emphasized that the categories in these three sub-datasets are disjoint. The number of samples in each category in \(Dt_{b}\) is much less than that in \(Ds_{b}\), and \(Dt_{b}\) is used as an auxiliary dataset (\(D_{aux}\)) during training. In the experiment, our network is trained on \(Ds_{b}\) and \(D_{aux}\) and tested on \(Dt_{n}\).

In order to maintain the consistency between model training and testing, N-way K-shot classification tasks (episodes) are employed to train the model for transfer knowledge well. To be specific, given a dataset, the samples of N classes are first randomly drawn from it, and then K labeled samples are stochastically extracted from each class to constitute the support set. The query set is then formed by randomly drawing q samples from the remaining samples of N classes, where the support and query sets samples are disjoint. In general, \(N=5\), \(K \in \left[ 1,5 \right] \), and \(q=15\sim 20\).

3.2 Modules

In this work, in order to achieve the adaptation process to the target domain, and at the same time more effectively balance the trade-off between different loss functions (tasks) to avoid the restrictive problems caused by simple addition. We propose a Target Oriented Dynamic Adaption (TODA) framework to solve the above issues. Specifically, inspired by James et al.’s study [36] that by identifying a small number of task-specific parameters, a model can be balanced between flexibility and robustness, we propose a Domain-specific Adapter (DSA) to achieve the adaptation process to the target domain. Meanwhile, a decoupling module is introduced to refine the features further and minimize the impact of domain shift on CD-FSL classification tasks. In addition, we employ multiple loss functions to constrain the model during model optimization. However, different loss functions have different priorities for the model performance, and the simple addition of each loss function often restricts the convergence direction of the model, thereby affecting the final performance of the model. Therefore, we propose an adaptive loss optimization module based on the homoscedastic uncertainty in the Bayesian model.

Our TODA model is exhibited in Fig. 2. The TODA framework comprises feature extraction network F, decoupling module D, and adaptive loss optimization. The feature extraction network F is composed of ResNet10 [37] and the domain-specific adaptor (DSA) module proposed in this study; we denote them by \(f_{\phi }\) and \(\gamma _{\beta }\) respectively, where \(\phi \) and \(\beta \) are their network parameters. The decoupling module \(D_{\theta } \) is an encoder that can generate two latent feature representations to achieve the purpose of decoupling. We use \(\theta \) to represent its network parameters. Adaptive loss optimization is composed of GNN classifier \(g_{\varphi _{fsl}}\), domain classifier \(g_{\varphi _{dom}}\) and an adaptive weight module \(A_{\omega }\), where \(\varphi _{fsl}\), \(\varphi _{dom}\) and \(\omega \) represent their parameters respectively. The adaptive weight module can allocate different weights according to the significance of different tasks to the model in the training process.

Fig. 2
figure 2

The overview of the proposed TODA for CD-FSL. The \( \varvec{D1 \& D2}\) represent domain-invariant and domain-specific features

This model adopts a two-stage training strategy. For the pre-training phase, the aim is to get a good initialization of the feature extraction network. Therefore, in this stage, we train the backbone network by utilizing the source base data \(Ds_{b}\). For the meta-training phase, the aim is to quickly adapt and learn the information of the new images in a meta-learning manner. We will train the entire network in this stage, including feature extraction network F, decoupling module D, and adaptive loss optimization under the meta-training paradigm. To take full advantage of auxiliary data, this paper introduces a simple but effective data augmentation technique - mixup, which works together with the meta-learning mechanism. Concretely, the source and auxiliary episodes will be randomly sampled during each iteration from \(Ds_{b}\) and \(D_{aux}\). During the mixup process, the support sets in the source and auxiliary episodes do not participate in the mixup operation because reliable support sets are required as judgment criteria in the FSL classification task. Therefore, only the images contained in the source query set \(Q_{sq}\) and the auxiliary query set \(Q_{aq}\) are linearly mixed at a certain ratio. The detailed operation is shown in the following formulation.

$$\begin{aligned} Q_{mix}=\lambda Q_{sq}+(1-\lambda ) Q_{aq}, \end{aligned}$$
(1)

where \(Q_{mix}\) represents the mixed query set, \(\lambda \) represents the mixed proportion and \(\lambda \sim {\text {beta }}(\alpha ,\alpha )\). The following is a specific exposition of the three modules constituting the TODA network framework.

3.2.1 Feature Extraction Module

A good feature extraction network is expected to extract discriminative feature representations so that the classifier can classify them better. In standard few-shot learning, the parameters of the backbone network are often fine-tuned on the support set to achieve the adaptation process to new tasks. However, in the CD-FSL setting, it becomes more challenging on account of the large domain shift between the source dataset and the target dataset, which requires the model to adapt not only to the new category information but also to the new domain information. The backbone network will suffer from improper optimization if it is still fine-tuned like standard few-shot learning. For example, a network trained on Mini-Imagenet [15] will be interested in objects closer to Mini-Imagenet, such as tree branches/mountains/background, etc., but will ignore the regions that are useful for identifying target classes. Therefore, we propose a lightweight Domain-specific Adapter (DSA) in this paper. Specifically, the problem of feature confusion is alleviated by adding the DSA module to the backbone network.

Fig. 3
figure 3

Illustration of our domain adaptation for cross-domain few-shot learning. “3\(\times \)3” denotes a 3\(\times \)3 convolution and “BN” means Batch Normalization layer

In this paper, we use ResNet10, which is widely used in FSL, as the backbone network of the TODA framework. We add DSA to each residual block of ResNet10 to form a new feature extraction network, and the new feature extraction network structure is displayed in Fig. 3. In model training, DSA can be guided by auxiliary dataset \(D_{aux}\) to make the extracted features more specific to the tasks in the target domain and reduce the impact of previous tasks in the source domain. Concretely, the DSA parameterized by \(\beta \) be incorporated into the output of the i-th layer. The formula description of the above operation is as follows:

$$\begin{aligned} f_{\left\{ \phi _{i}, \beta \right\} }(\varvec{x})=\gamma _{\beta }\left( f_{\phi _{i}}(\varvec{x}), \varvec{x}\right) , \end{aligned}$$
(2)

where \(\varvec{x} \in \mathbb {R}^{B\times C\times H\times W}\) is the input of the network, \(f_{\phi _{i}}\) is the i-th layer in \(f_{\phi }\) where \(\phi _{i}\) indicates its network parameters. The specific structure of the DSA is implemented in our code by applying the 1\(\times \)1 convolution operation. Importantly, the number of parameter \(\beta \) of the domain-specific adaptor is significantly less than the backbone network parameter \(\phi \).

In this work, we design two approaches to fuse DSA into the feature extraction network: series form and residual form. The output of connecting DSA to the \(f_{\phi _{i}}\) layer through series connection and residual connection is as follows:

$$\begin{aligned} f_{\left\{ \phi _{i}, \beta \right\} }(\varvec{x})=\gamma _{\beta } \circ f_{\phi _{i}}(\varvec{x}), \end{aligned}$$
(3)
$$\begin{aligned} f_{\left\{ \phi _{i}, \beta \right\} }(\varvec{x})=\gamma _{\beta }(\varvec{x})+f_{\phi _{i}}(\varvec{x}). \end{aligned}$$
(4)

Experiments show that the residual form is better than the series form for improving the model performance. We will discuss this in Sect. 4.3.

3.2.2 Decoupling Module

To reduce domain shift, this study exploits a variant of the encoder in VAE to decouple the feature representations extracted in the feature extraction network into domain-invariant features \(\varvec{D}1\) and domain-specific features \(\varvec{D}2\). We hold the opinion that the features extracted through the feature extractor contain the category information and domain information of the image, and the domain information may affect the classification effect of the few-shot classifier and bring a negative gain to the accuracy of classification.

The decoupling module in this work is different from the encoder in VAE, which generates two latent feature representations to achieve the purpose of decoupling. To be specific, in a VAE encoder, a series of linear layers is typically used for processing the input data and generating the hidden representation. These linear layers can learn abstract features of the data and convert them into mean and variance in the latent space. In this work, we generate two hidden representations (mean and variance) by adding two additional linear layers (i.e., FC1a and FC1b) to the top layer of the encoder in VAE. At the same time, the decoupling module can better extract the decoupled information (domain-invariant and domain-specific features) under the supervision of the domain classifier so as to achieve the purpose of decoupling.

The decoupling module is displayed in Fig. 4. It contains five fully connected layers, a batch normalization layer, and a relu activation layer. The role of "FC12" is to extract the generic representation, while "FC1a" and "FC1b" are used to compute mean \(\mu 1\) and variance \(\sigma 1\) for synthesizing domain-invariant features. Correspondingly, "FC2a" and "FC2b" are used to calculate mean \(\mu 2\) and variance \(\sigma 2\) for synthesizing domain-specific features. Specifically, the features generated by the source support set and auxiliary support set through the feature extraction network are further sent to the decoupling module to generate domain-invariant features \(\varvec{D}1_{sou}\), \(\varvec{D}1_{aux}\) and domain-specific features \(\varvec{D}2_{sou}\), \(\varvec{D}2_{aux}\). Similarly, for the mixed query set, domain-invariant feature \(\varvec{D}1_{Qmix}\) and domain-specific feature \(\varvec{D}2_{Qmix}\) will also be generated.

Fig. 4
figure 4

The structure of the decoupling module

3.2.3 Adaptive Loss Optimization

Ultimately, the TODA is jointly optimized by multi-task learning consisting of few-shot classification task, domain discrimination task, and domain confusion task. The decoupled domain-irrelevant features contain the semantic information of the images but not the domain information, so it is applied to the few-shot classification task. The domain-specific features mainly possess domain information and are used for the domain classification task.

For the few-shot classification task, the domain-invariant features \(\varvec{D}1_{sou}\), \(\varvec{D}1_{aux}\), and \(\varvec{D}1_{Qmix}\) will be fed into the GNN classifier for classification. The GNN classifier will predict the label of the mixed query images and output the predicted value. However, considering that the mixed query images are obtained by linearly mixing the source query images and the auxiliary query images at the proportion of \(\lambda \). Therefore, the confidence score of dividing the mixed query images into the corresponding category set in the source support set is \(\lambda \) so that this partial FSL classification loss is \(\lambda \mathcal {L}_{SFSL}\). Identically, the classification loss of matching the images in the mixed query set to its corresponding category in the auxiliary support set is \((1-\lambda ) \mathcal {L}_{AFSL}\). The formulation of the final FSL loss is shown as follows:

$$\begin{aligned} \mathcal {L}_{FSL}=\lambda \mathcal {L}_{SFSL}+(1-\lambda ) \mathcal {L}_{AFSL}. \end{aligned}$$
(5)

For the domain confusion task, we expect that the domain classifier will be confused when it sees the domain invariant features \(\varvec{D}1\), so that it cannot be distinguished which domain it belongs to and thus forms the domain confusion loss \(\mathcal {L}_{\text{ con }}\). The domain classifier is a fully-connected classifier. The features of the input domain classifier will be mapped to the labels of the two domains, where we assume 0 and 1 represent categories representing the source and target domains, respectively. The domain confusion loss function is shown in (6) below.

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\text{ con }}&=\frac{1}{3} \sum \left[ {\text {KL}}\left( g_{\varphi _{dom}}\left( \varvec{D}1_{sou }\right) , \varvec{y}1_{sou}\right) \right. \\&\quad +{\text {KL}}\left( g_{\varphi _{dom}}\left( \varvec{D }1_{aux}\right) , \varvec{y }1_{aux}\right) \\&\quad \left. +{\text {KL}}\left( g_{\varphi _{dom}}\left( \varvec{D }1_{Qmix}\right) , \varvec{y }1_{Qmix}\right) \right] ,\\ \end{aligned} \end{aligned}$$
(6)

where \({\text {KL}}\left( \right) \) represents the Kullback-Leibler divergence loss, and \(\varvec{y}1_{sou}\), \(\varvec{y}1_{aux}\) and \(\varvec{y}1_{Qmix}\) represent the ground truth of \(\varvec{D}1_{sou}\), \(\varvec{D}1_{aux}\) and \(\varvec{D}1_{Qmix}\), respectively.

For the domain discrimination task, the domain classifier is expected to easily recognize domain labels when domain-specific features \(\varvec{D}2\) are fed into the domain classifier, thus constituting the domain discrimination loss \(\mathcal {L}_{\text{ dis }}\).

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\text{ dis }}&= \frac{1}{3} \sum \left[ \textrm{CE}\left( g_{\varphi _{dom}}\left( \varvec{D }2_{sou}\right) , \varvec{y}2_{sou}\right) \right. \\&\quad + \textrm{CE}\left( g_{\varphi _{dom}}\left( \varvec{D}2_{aux }\right) , \varvec{y}2_{aux}\right) \\&\quad +\lambda \cdot \textrm{CE}\left( g_{\varphi _{dom}}\left( \varvec{D}2_{Qmix }\right) , \varvec{y}2_{mix1}\right) \\&\quad \left. +(1-\lambda ) \cdot \textrm{CE}\left( g_{\varphi _{dom}}\left( \varvec{D}2_{Qm i x}\right) , \varvec{y}2_{mix2}\right) \right] , \end{aligned} \end{aligned}$$
(7)

where \(\textrm{CE}\left( \right) \) represents the cross-entropy loss, and \(\varvec{y}2_{sou}\), \(\varvec{y}2_{aux}\), \(\varvec{y}2_{mix1}\) and \(\varvec{y}2_{mix2}\) represent the ground truth of \(\varvec{D}2_{sou}\), \(\varvec{D}2_{aux}\) and \(\varvec{D}2_{Qmix}\), respectively.

These three tasks are performed simultaneously to optimize the model, but different tasks may have different importance to the model, and simply adding these three losses may not be conducive to model optimization. Therefore, we design a module that allots different weights to different tasks according to the importance of the task to achieve the purpose of adaptive optimization. In Bayesian modeling, the homoscedastic uncertainty is independent of the input and depends on the intrinsic uncertainty of the task. By converting the homoscedastic uncertainty into the weight of the loss, the model can possess the ability to dynamically adjust the loss. We define the homoscedastic uncertainty as \(\eta \) in this work. After we obtain the output prediction value of the neural network and the corresponding ground truth, we can calculate the parameter \(\eta \) by the maximum likelihood estimation.

Define the probabilistic model as follows. For classification tasks, \(f^{\textrm{W}}\left( \varvec{x}\right) \) denotes the output prediction value of the neural network, and the network output is re-scaled by Softmax.

$$\begin{aligned} p\left( \varvec{y} \mid f^{\textrm{W}}(\varvec{x}), \eta \right) ={\text {Softmax}}\left( \frac{1}{\eta ^{2}} f^{\textrm{W}}(\varvec{x})\right) . \end{aligned}$$
(8)

This representation can be seen as a Boltzmann distribution, where \(\eta ^2\) can be seen as kT, which is the product of the Boltzmann constant k and the thermodynamic temperature T. Therefore, the maximum likelihood estimate of the classification problem is as follows:

$$\begin{aligned} \begin{aligned} \log p\left( \varvec{y}=c \mid f^{\textrm{W}}(\varvec{x}), \eta \right) =\frac{1}{\eta ^{2}} f_{c}^{\textrm{W}}(\varvec{x}) -\log \sum _{c^{\prime }} \exp \left( \frac{1}{\eta ^{2}} f_{c^{\prime }}^{\textrm{W}}(\varvec{x})\right) , \end{aligned} \end{aligned}$$
(9)

where \(f_c^{\textrm{W}}\left( \varvec{x}\right) \) is the output of the c-th class in \(f^{\textrm{W}}\left( \varvec{x}\right) \). Then, let the loss of the classification problem be:

$$\begin{aligned} \textrm{L}_i\left( \textrm{W} \right) = - \log {\text {Softmax}}\left( y_i,f^{\textrm{W}}(\varvec{x}) \right) , \end{aligned}$$
(10)

where \(f^{\textrm{W}}(\varvec{x})\) is not scaled. We further simplify (9) to obtain:

$$\begin{aligned} \begin{aligned} \mathcal {L}(\textrm{W}, \eta )&=-\log p\left( y=\textrm{c} \mid f^{\textrm{W}}(\varvec{x}), \eta \right) \\&=-\log p\left( y=\textrm{c} \mid f^{\textrm{W}}(\varvec{x}), \eta \right) +\frac{1}{\eta ^{2}} \textrm{L}_{c}(\textrm{W})-\frac{1}{\eta ^{2}} \textrm{L}_{c}(\textrm{W}) \\&=\frac{1}{\eta ^{2}} \textrm{L}_{c}(\textrm{W})-\log p\left( y=\textrm{c} \mid f^{\textrm{W}}(\varvec{x}), \eta \right) -\frac{1}{\eta ^{2}} \textrm{L}_{c}(\textrm{W}) \\&=\frac{1}{\eta ^{2}} \textrm{L}_{c}(\textrm{W})-\frac{1}{\eta ^{2}} f_{\textrm{c}}^{\textrm{W}}(\varvec{x})+\log \sum _{\textrm{c}^{\prime }} \exp \left( \frac{1}{\eta ^{2}} f_{\textrm{c}^{\prime }}^{\textrm{W}}(\varvec{x})\right) \\&\quad +\frac{1}{\eta ^{2}} \log {\text {Softmax}}\left( y=\textrm{c}, f^{\textrm{W}}(\varvec{x})\right) \\&=\frac{1}{\eta ^{2}} \textrm{L}_{c}(\textrm{W})-\frac{1}{\eta ^{2}} f_{\textrm{c}}^{\textrm{W}}(\varvec{x})+\log \sum _{\textrm{c}^{\prime }} \exp \left( \frac{1}{\eta ^{2}} f_{\textrm{c}^{\prime }}^{\textrm{W}}(\varvec{x})\right) \\&\quad +\frac{1}{\eta ^{2}}\left[ f_{\textrm{c}}^{\textrm{W}}(\varvec{x})-\log \sum _{\textrm{c}^{\prime }} \exp \left( f_{\textrm{c}^{\prime }}^{\textrm{W}}(\varvec{x})\right) \right] \\&=\frac{1}{\eta ^{2}} \textrm{L}_{c}(\textrm{W})+\log \frac{\sum _{\textrm{c}^{\prime }} \exp \left( \frac{1}{\eta ^{2}} f_{\textrm{c}^{\prime }}^{\textrm{W}}(\varvec{x})\right) }{\left[ \sum _{\textrm{c}^{\prime }} \exp \left( f_{\textrm{c}^{\prime }}^{\textrm{W}}(\varvec{x})\right) \right] ^{\frac{1}{\eta ^{2}}}}\\&\approx \frac{1}{\eta ^{2}} \textrm{L}_{c}(\textrm{W})+\log \eta . \end{aligned} \end{aligned}$$
(11)

When the predicted value \(f^{\textrm{W}}\left( \varvec{x}\right) \) and the label \(\varvec{y}\) are known, we can utilize the gradient descent algorithm to acquire the final \(\eta \). In this work, our TODA has three optimization tasks, which will produce three outputs \(\varvec{y}_{1}, \varvec{y}_{2}, \varvec{y}_{3}\). Therefore, the model is modelled with multiple softmax likelihoods separately. The joint loss function \(\mathcal {L}\left( \mathrm {~W}, \eta _{1}, \eta _{2}, \eta _{3}\right) \) is:

$$\begin{aligned} \begin{aligned} \mathcal {L}&\left( \textrm{W}, \eta _{1}, \eta _{2}, \eta _{3}\right) =-\log p\left( y_1, y_2,y_3\mid f^{\textrm{W}}(\varvec{x})\right) \\&=-\log \left( p\left( y_1\mid f^{\textrm{W}}(\varvec{x}),\eta _{1}\right) \cdot p\left( y_2\mid f^{\textrm{W}}(\varvec{x}),\eta _{2}\right) \cdot p\left( y_3\mid f^{\textrm{W}}(\varvec{x}),\eta _{3}\right) \right) \\&=-\log p\left( y_1\mid f^{\textrm{W}}(\varvec{x}),\eta _{1}\right) -\log p\left( y_2\mid f^{\textrm{W}}(\varvec{x}),\eta _{2}\right) -\log p\left( y_3\mid f^{\textrm{W}}(\varvec{x}),\eta _{3}\right) \\&\approx \frac{1}{\eta _{1}^{2}} \textrm{L}_{1}(\textrm{W})+\frac{1}{\eta _{2}^{2}}\textrm{L}_{2}(\textrm{W})+\frac{1}{\eta _{3}^{2}}\textrm{L}_{3}(\textrm{W}) +\log \eta _{1}+\log \eta _{2}+\log \eta _{3} , \end{aligned} \end{aligned}$$
(12)

where \(\eta _{1}\), \(\eta _{2}\) and \(\eta _{3}\) are the adaptive weights of the three tasks. The detailed training process of TODA is demonstrated in Algorithm 1. In this paper, \(\textrm{L}_{1}=\mathcal {L}_{F S L}\), \(\textrm{L}_{2}=\mathcal {L}_{\text {dis}}\) and \(\textrm{L}_{3}=\mathcal {L}_{\text {con}}\). Our final total loss function is presented as follows:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\text {total}} =\frac{1}{\eta _{1}^{2}} \mathcal {L}_{FSL}+\frac{1}{\eta _{2}^{2}}\mathcal {L}_{\text {dis}}+\frac{1}{\eta _{3}^{2}}\mathcal {L}_{\text {con}} +\log \eta _{1}+\log \eta _{2}+\log \eta _{3} . \end{aligned} \end{aligned}$$
(13)
Algorithm 1
figure a

The training procedure for TODA

4 Experiments

4.1 Experimental Setups

4.1.1 Datasets

  • Mini-ImageNet [15] is a broadly used image dataset for FSL. In 2016, the Google DeepMind team created the Mini-Imagenet from a small subset of the Imagnet dataset and used it for FSL research. Since then, Mini-ImageNet has become the benchmark dataset for meta-learning and FSL. Mini-ImageNet has 100 categories, each with 600 images, for a total of 60,000 images.

  • CUB [38] (Caltech-UCSD Birds-200-2011) is a classic bird image dataset created in collaboration between the California Institute of Technology (Caltech) and the University of California, San Diego (UCSD). The dataset aims to provide a rich, diverse, and challenging sample of bird images for bird recognition and classification research. The CUB dataset contains 200 species of birds, totaling about 6,000 images.

  • Cars [39] is a car image dataset widely used in computer vision (CV). Created by Stanford University, it provides researchers with a large sample of images containing diverse car makes, models, and colors. The Cars dataset contains over 16,000 images of cars from 196 distinct brands and models taken in different environments, including city streets, highways, etc., with various poses and angles to capture the fine features and details of the cars.

  • Places [40] dataset is created by MIT to provide powerful training and evaluation resources for tasks such as scene understanding, image classification, and image generation. Places is a large-scale dataset containing millions of real scene images, collecting 365 scene types worldwide. This dataset covers a wide variety of locations and scenes, including city streets, natural landscapes, indoor spaces, and more. The Places dataset is ideal for studying problems such as scene classification, scene recognition, and image generation.

  • Plantae [41] dataset is created in collaboration with botanists and computer vision experts to provide a high-quality, diverse sample of plant images to support the identification and classification of plant species. The Plantae data collects images from 200 different plant species, covering a rich variety of plant species, including flowers, trees, herbs, etc. The establishment of this dataset provides an important resource for research at the intersection of botany and CV.

In this experiment, we evaluate our method on five datasets: Mini-Imagenet, CUB, Cars, Places, and Plantae. Mini-Imagenet is used as the source dataset \(D_{sou}\), and the other four datasets are taken as the target dataset \(D_{tar}\). We further obtain the source base dataset \(Ds_{b}\), auxiliary dataset \(D_{aux}\), and target novel data \(Dt_{n}\) according to the division in Meta-FDMixup [13].

4.1.2 Implementation Details

In the experiment, the value of \(\alpha \) in the beta distribution during the mixing process is 1. In both training phases, the network is trained for 400 epochs. For network optimization, we employ Adam as the optimizer for this experiment. The learning rate of the Adam in the pre-training phase is 0.001. In the meta-training stage, the learning rate of the feature extraction module is set to 0.001, and that of the decoupling module and classifier is set to 0.002. The feature extraction network in the pre-training stage is ResNet10 without adding our domain-specific adaptors, and our DSA is combined with ResNet10 in the meta-training stage to form a new feature extraction network. In the first 200 epochs of the meta-training phase, the parameters of rensnet10 in the feature extraction module will be frozen and only optimize the parameters in the DSA; in the last 200 epochs, we do not freeze the parameters of ResNet10 and update them together with DSA. GNN is employed as the few-shot classifier, and the fully connected classifier is used as the domain classifier. We evaluate the performance of the TODA in 5-way 1-shot and 5-way 5-shot settings. The average accuracy of 1000 episodes is taken as the ultimate experimental result.

4.2 Main Results and Comparisons

To prove the effectiveness of the TODA method, we implement abundant experiments on four target datasets: CUB, Cars, Places, and Plantae. At the same time, we compare TODA with multiple FSL and CD-FSL methods, among which FSL methods include MatchingNet [4], RelationNet [3], and GNN [5]. CD-FSL methods include ATA [10], Wave-SAN [32], Meta-FDMixup [13], and Generalized Meta-FDMixup [42], where Meta-FDMixup is our baseline method. Except for Meta-FDMixup and Generalized Meta-FDMixup, none of the other methods uses auxiliary data. To increase the diversity of contrasts, we modify the LPR [43] method by appending a small amount of target data in [43] to the initial training set for retraining and label the improved method as m-LPR. We contrast the ultimate experimental results of TODA with the methods mentioned above, which are presented in Tables 1 and 2. The model’s classification accuracy is evaluated on 1000 episodes with \(95\%\) confidence intervals.

From Tables 1 and 2, we can observe that the TODA method proposed in this study outperforms other competing methods in most cases. Concretely, in the 5-way 5-shot case, TODA achieves 81.47%, 67.86%, 79.68%, and 70.06% classification accuracy on CUB, Cars, Places, and Plantae. Compared with the baseline, our classification accuracy is improved by 2.01%, 1.34%, 0.76%, and 0.84% on CUB, Cars, Places, and Plantae, respectively. Meanwhile, the performance of our model transcends the state-of-the-art method [42], which certifies the superiority of the TODA. In the 5-way 1-shot case, our method achieves 64.38%, 49.22%, 59.58%, and 51.31% classification accuracy on CUB, Cars, Places, and Plantae. In addition, we also observe the following phenomena from the experimental results, which are summarized as follows: (1) The FSL method performance is unsatisfactory in the CD-FSL setup. (2) Methods that use an auxiliary target dataset during training generally perform better than that without an auxiliary target dataset.

Table 1 The Average classification accuracy (%) with 95% confidence intervals for 1000 episodes on the four target datasets under the 5-way 5-shot setting
Table 2 The Average classification accuracy(%) with 95% confidence intervals for 1000 episodes on the four target datasets under the 5-way 1-shot setting

4.3 Ablation Study

We first discuss the advantages of the DSA and adaptive weight module in TODA, and then investigate the effect of the DSA connection mode on the final classification results of the TODA model. Ablation experiments and analysis are shown below.

(1) Analysis of DSA and adaptive weight module: To prove the effectiveness of DSA and adaptive loss optimization, we implement experimental validation in the 5-way 5-shot setting. Specifically, we add DSA and adaptive weight modules to the baseline method to certify the effectiveness of DSA and adaptive weight module, respectively. The specific experimental results are displayed in Table 3.

From Table 3, we can see that the performance of the network is significantly raised after adding DSA and adaptive weight modules, respectively, which proves the effectiveness of these two components. The classification results show that after adding DSA to the backbone network, more discriminative features can be extracted for the target domain and the new class data, thereby mitigating the problem of feature confusion when the network transfers from the source domain to the target domain. In addition, we also find that the introduction of the adaptive weight module not only improves the performance of the network but also expedites the convergence of the network. The network reaches convergence at the 250th epoch, which indicates that the adaptive weight module can adaptively allot different weights to different tasks based on the importance of different tasks to the model so that the model convergence is accelerated.

Table 3 Ablation study to certify the effectiveness of DSA and adaptive weight module, where ”DSA” indicates the domain-specific adaptor and ”AW” indicates the adaptive weight module
Table 4 Comparison of two connection methods. Where ”s” represents the series connection and ”r” represents the residual connection

(2) Analysis of different connection modes: In this work, we design two connection strategies for how to add DSA to the residual block in ResNet10. However, we do not know which of the two connection strategies is more conducive to improving model performance. We conduct 5-way 5-shot experimental verification on the baseline method to prove which connection strategy is more effective.

Table 4 displays the comparison results of the two join strategies. We can observe from the table that the residual connection method is better than the series connection method on the four target datasets, proving that the residual connection method is more conducive to improving model performance than the series connection. Therefore, in this paper, the DSA and the backbone network are fused in the form of residual connections. We can see from this result that the residual structure can increase the network’s generalization ability, making it preferable to transfer knowledge when facing new domains and classes.

5 Conclusion

In this paper, we propose a TODA model, which effectively alleviates the problem of inadaptability when the network trained on the source domain transfers to the target domain and further improves the generalization ability of the model. Specifically, we ameliorate the problem of feature confusion when the model is migrated to the target domain by introducing a domain-specific adaptor to fit the data distribution in the target domain. Besides, we adopt a multi-task optimization scheme to adjust the weights of different tasks during model optimization dynamically. Extensive experiments on four publicly available benchmark datasets have confirmed that the proposed TODA method can achieve excellent performance on the CD-FSL task.