1 Introduction

Deep learning models have achieved breakthroughs in many visual understanding tasks, such as image classification [17] and object detection [20]. However, their performances degrade significantly when few labelled examples are available and these large-capacity models usually have difficulties in transferring the learnt knowledge to unseen classes. As such, there has been increasing interest in few-shot learning (FSL) [14, 38, 40], which aims to develop well-generalized models to recognize new categories using only limited annotated samples.

Recently, meta-learning [8, 33, 42], a.k.a. learning-to-learn, has made significant advances in FSL. The primary idea of meta-learning is to exploit episodic training to transfer the knowledge learnt from a massive known meta-training set to novel classes. For example, Sun et al. [39] proposed Meta-Transfer Learning, which introduced the scaling and translation parameters to adjust the weight parameters, meeting the needs of new tasks. However, the success of transfer ability depends on the similarity between tasks. It’s difficult to transfer knowledge effectively when new tasks are significantly different from old ones. Another practical solution to FSL is metric-based methods [23, 31, 45], which aims to learn good representations through deep networks and generate final predictions based on the similarity or distance between support and query samples, such as RelationNet [40] and ProtoNet [38]. Though metric learning has become the prominent methods for FSL, it’s still challenging to select an appropriate distance metric for various tasks and datasets, as different metrics can yield different results.

In this paper, we propose a new dual selective knowledge transfer (DSKT) framework for FSL tasks, following the transfer learning strategy. The framework and overall training process are illustrated in Fig. 1. In the first stage of the DSKT framework (DSKT-1), we introduce a multi-task loss to mine useful signals or information from the limited labelled examples. This enables the model to reduce the risk of overfitting and to ensure heterogeneity in the prediction space, simultaneously.

Fig. 1
figure 1

Overall training process of DSKT, where \({{f}_{\phi }}\) is a feature extractor, \({{g}_{\theta }}\) and \({{g}_{\varphi }}\) are two different classifiers. DSKT-1 denotes the first process which includes a binary cross entropy (BCE) loss and a cross entropy (CE) loss, with the goal to learn a good feature extractor. DSKT-2 represents the stage of dual selective knowledge distillation (DSKD), which selectively transfers knowledge to ensure the student model’s generalization ability

Furthermore, to implement selective knowledge transfer, we propose an effective dual selective knowledge distillation method in the second stage (DSKT-2). Specifically, we first employ the learnt model in DSKT-1 as a teacher network and train a student model based on the teacher’s outputs. Then, different from the classical knowledge distillation (KD) [18] loss that considers different classes equally, we reformulate the KD loss into two independent components according to the class labels of samples, inspired by [50]. By separately fine-tuning the KD strength of each component, models are endowed with the ability to selectively transfer knowledge. Meanwhile, we introduce PolyLoss [22] to adjust the strengths of learning on current samples, which complements the knowledge transferred from the teacher model. After training, we freeze the student model up to the penultimate layer and apply it as a feature extractor. During the evaluation, we fit a logistic regression classifier based on the frozen feature extractor without fine-tune, implemented as [41].

Compared to previous work[32], we take similar action in the first stage to train a discriminative feature extractor. However, we design a distinct DSKD framework for knowledge distillation. The difference and superiority can be simply concluded as follows: 1) We only use the original samples during the knowledge distillation process, which reduces computation to some extent. 2) We propose a new DSKD framework for knowledge distillation, which integrates decoupled knowledge distillation (DKD) and PolyLoss, enabling the student model could effectively learn knowledge from both the teacher model and current samples, enhancing the student model’s capacity to generalize on new tasks.

The main contributions of this work are as follows: 1) We incorporate KD and PolyLoss as a knowledge transfer framework for FSL tasks. 2) We exploit DKD to improve the conventional KD, which enables the student model to selectively learn knowledge from the teacher. 3) the experiments on four popular benchmark datasets show that our approach achieves new state-of-the-art performances in all the 5-way 1-shot and 5-way 5-shot tasks.

2 Related work

Few-shot learning (FSL)

As a surprising research area in deep learning, FSL focus on learning patterns from a set of data (base dataset) and then adapting to a disjoint set (novel dataset) with few labelled samples. In order to evaluate the model’s performance, the novel dataset is split into a support set and a query set. FSL is also named N-way K-shot classification when the support set contains N classes and K samples per class.

Two main training streams have been explored in few-shot learning, namely meta-learning and metric learning. On one hand, meta-learning consists of two phases, namely meta-training and meta-testing, where each phase involves a family of tasks. Meta-learning aims to perform well on new tasks based on the knowledge learnt from previous tasks. For example, Finn et al. [8] proposed a model-agnostic meta-learning (MAML) approach, which can adapt to new tasks quickly with a better initialization weight. With the goal to simply MAML, Reptile [30] removes re-initialization for each task. MetaOptNet [21] employs SVM to replace its original linear classifier and solves the convex optimization problem by using quadratic programming (QP). Sun et al. [39] proposed a meta-transfer learning (MTL) method, which learns to adapt a deep neural network for few-shot learning tasks. Bansal et al. [5] proposed a meta-learning model, which combines supervised learning and self-supervised learning for natural language classification tasks.

On the other hand, metric learning pays attention to finding an existing or learned metric space, where the support set can be simply matched with the query set. In metric learning, additional parameters are not required to learn once the metric is acquired. Snell et al. [38] proposed ProtoNet, which learns a metric space where classification can be performed by computing distances to prototype representations of each class. RelationNet [40] applies network to learn the relation (similarity) between support and query images. Based on memory module, MatchingNet [42] uses cosine distance as a metric to classify query samples. To achieve better classification, CAN [19] is designed to highlight the proper region of interest by learning an attention module. Zhang et al. [48] proposed a DeepEMD algorithm, which adopts the Earth Mover’s Distance (EMD) as a distance metric to calculate the similarity. Different from these approaches, we propose a new two-stage knowledge transfer framework, further improving the model’s generalization ability for FSL tasks.

Knowledge distillation (KD)

With the goal to deploy cumbersome deep models on devices with limited resources, KD is developed for model compression and acceleration. KD works by effectively transferring knowledge from a large and complex model, called teacher model, to a smaller and simple one, called student model. During the process of KD, the student learning performance is influenced by multiple factors, such as knowledge types, distillation strategies and teacher-student architectures.

Hinton et al. [18] first generalized and brought this idea into deep learning. By minimizing the loss between the teacher model’s and student model’s outputs, informative knowledge can be transferred. Tasks [10, 11] show KD’s benefits for knowledge transfer and optimization. Besides, sequential distillation [9] is introduced to improve the performance of teacher models. Mobahi et al. [29] presented a theoretical analysis of self-distillation. Lim et al. [24] proposed an Efficient-PrototypicalNet, which involves both transfer learning and knowledge distillation for few-shot learning tasks. Liu et al. [27] proposed learning a model through online self-distillation, which combines supervised training with knowledge distillation via a continuously updated teacher. A novel Supervised Masked Knowledge Distillation model (SMKD) [25] is also designed for few-shot Transformers, which incorporates label information into self-distillation frameworks. Unlike commonly-used KD which only transfers knowledge from the teacher model, we use a dual selective knowledge distillation method to encourage the student model to selectively learn knowledge from the teacher model and examples simultaneously.

3 Dual selective knowledge transfer

To address the challenges caused by FSL, we propose a new dual selective knowledge transfer (DSKT) framework, which consists of two stages: multi-task learning and dual selective knowledge distillation (DSKD). In the first stage, we introduce multi-task learning and enforce the learnt representations equivariant to image transformations, which is beneficial for extracting low-level features. Furthermore, we propose an effective DSKD, which enables the student model to selectively learn knowledge from both the teacher model and the current samples.

3.1 Problem formulation

In this work, we use a large-scale labelled base dataset for training a feature extractor. The base dataset is defined as \({{D}^{base}}=\left\{ x_{t}^{base},y_{t}^{base} \right\} _{t=1}^{{{N}^{base}}}\), with label \(y_{t}^{base}\in {{C}^{base}}\). In order to learn a good feature extractor, we hold the assumption that both the amount of classes \(\left( \left|{C}^{base} \right|\right) \) and that of examples \(\left( {N}^{base} \right) \) are large. For the novel dataset used for evaluation, we denote it as \({{D}^{novel}}=\left\{ x_{t}^{novel},y_{t}^{novel} \right\} _{t=1}^{{{N}^{novel}}}\), with label \(y_{t}^{novel}\in {{C}^{novel}}\). Notice that base classes and novel classes are disjoint, which means \({{C}^{base}}\cap {{C}^{novel}}=\varnothing \). The testing of the learnt model is organized in episodes, in which each episode contains a support set \(D_{i}^{support}=\left\{ x_{i,t}^{support},y_{i,t}^{support} \right\} _{t=1}^{CK}\) and a query set \(D_{i}^{query}=\Big \{ x_{i,t}^{query},\) \(y_{i,t}^{query} \Big \}_{t=1}^{CK'}\), which contains C classes, with K and K’ examples per class from the novel dataset \({{D}^{novel}}\), respectively.

3.2 Multi-task Learning

In the first stage of our DSKT framework, we introduce a multi-task learning method which consists of category task and rotation task to train a discriminative feature extractor with good representations on the base dataset. Following a similar training strategy [41], we adopt a neural network that contains a feature extractor \({{f}_{\phi }}\), a category classifier \({{g}_{\theta }}\) and an additional rotation classifier \({{g}_{\varphi }}\).

First, we randomly sample a minibatch \(B=\left\{ {{x}_{i}},{{y}_{i}} \right\} _{i=1}^{m}\) from the base dataset \({{D}^{\text {base}}}\), where m stands for the batch size. We impose the rotate transformation with 90, 180 and 270 degrees on the images x, to generate augmented copies \({{x}^{90}}\), \({{x}^{180}}\) and \({{x}^{270}}\), respectively. Then, we stack the original images and their transformed versions together, resulting in a single tensor \(\overset{\wedge }{\mathop {x}}\,=\left\{ x,{{x}^{90}},{{x}^{180}},{{x}^{270}} \right\} \in {{\mathbb {R}}^{\left( h\times w \right) \times \left( 4\times m \right) }}\) with corresponding labels \(\overset{\wedge }{\mathop {y}}\,\in {{\mathbb {R}}^{4\times m}}\). According to the rotation direction, we also create one-hot encoded labels \(\overset{\wedge }{\mathop {r}}\,={{\left\{ {{r}_{i}}\in {{\mathbb {R}}^{s}} \right\} }_{4\times m}}\), where \(s=4\) indicates that four rotation directions are applied in our method.

Next, by using the feature extractor \({{f}_{\phi }}\), the stacked tensor \(\overset{\wedge }{\mathop {x}}\,\) is mapped to feature vectors \(\overset{\wedge }{\mathop {v}}={{f}_{\phi }}\left( \overset{\wedge }{\mathop {x}}\, \right) \in {{\mathbb {R}}^{d\times \left( 4\times m \right) }}\), where d denotes the feature map size. Then, we pass the feature vectors \(\overset{\wedge }{\mathop {v}}\,\) through category classifier \({{g}_{\theta }}\) to obtain its corresponding logits \(\overset{\wedge }{\mathop {p}}={{g}_{\theta }}\left( \overset{\wedge }{\mathop {v}} \right) \in {{\mathbb {R}}^{c\times \left( 4\times m \right) }}\). Finally, we apply rotation classifier \({{g}_{\varphi }}\) to map the logits \(\overset{\wedge }{\mathop {p}}\,\) to the rotation logits \(\overset{\wedge }{\mathop {q}}={{g}_{\varphi }}\left( \overset{\wedge }{\mathop {p}}\, \right) \in {{\mathbb {R}}^{s\times \left( 4\times m \right) }}\).

Afterwards, to train all network modules (i.e., \({{f}_{\phi }}\), \({{g}_{\theta }}\) and \({{g}_{\varphi }})\), we use two loss functions to jointly optimize three modules. One is a commonly-used cross entropy loss [41] for classifying categories of samples, which is denoted by \({{\ell }_{CE}}\). The other is a binary cross entropy loss for rotation prediction, which serves as the rotation loss \({{\ell }_{BCE}}\). The optimization problem is formulated as:

$$\begin{aligned} \phi ,\theta ,\varphi= & {} \arg \underset{\phi ,\theta ,\varphi }{\mathop {\min }}\,{{\mathbb {E}}_{(x,y)\sim {D}^{base}}}\left[ {{\ell }_{CE}}\left( {{g}_{\theta }}\left( {{f}_{\phi }}\left( \overset{\wedge }{\mathop {x}}\, \right) \right) ,\overset{\wedge }{\mathop {y}}\, \right) \right. \nonumber \\{} & {} \text { }\left. +\alpha {{\ell }_{BCE}}\left( {{g}_{\varphi }}\left( {{g}_{\theta }}\left( {{f}_{\phi }}\left( \overset{\wedge }{\mathop {x}}\, \right) \right) \right) ,\overset{\wedge }{\mathop {r}}\, \right) \right] . \end{aligned}$$
(1)

where \(\alpha \) is a weighting coefficient to control the strength of rotation loss.

Learning a good feature extractor is challenging in FSL as limited labelled samples are available during this process. In this work, we introduce multi-task learning and it enables the feature extractor to effectively learn the low-level features [3]. In addition, the risk of overfitting can be reduced through multi-task loss [3]. Experimental results in Table 4 demonstrate the superiority of our method.

3.3 Dual selective knowledge distillation

To improve the model’s generalization ability by knowledge transfer, we develop an effective dual selective knowledge distillation (DSKD) to promote models in learning informative knowledge. Once the first stage is finished, we take one clone of the trained model as a teacher model, where weights are frozen and only used for inference. The knowledge distillation (KD) [18] loss is reformulated into two components, by separately adjusting the strength of each component, informative knowledge can be selectively transferred to a new student model, which only contains a feature extractor and a classifier. In addition, PolyLoss [22] is introduced to ensure the student model selectively learns knowledge from current training samples. By considering such dual learning strategies, our DSKD enables the student model to simultaneously learn knowledge from the teacher model and current samples.

To make the different components of the transferred knowledge controllable, we first reformulate the KD loss into target class knowledge distillation (TCKD) loss and non-target class knowledge distillation (NCKD) loss, inspired by [50]. For the i-th input image, we use \({{p}_{i}}\in {{\mathbb {R}}^{c}}\) to denote its output logits, and \({p}_{i,t}\) represents the logit of the t-th class. Hence, the possibility that the i-th sample belongs to the t-th class \({{z}_{i,t}}\) and all the other classes \({{z}_{i,\backslash t}}\) can be formulated as:

$$\begin{aligned} {{z}_{i,t}}=\frac{{{e}^{{{p}_{i,t}}}}}{\sum \nolimits _{j}{{{e}^{{{p}_{i,j}}}}}},\text { }{{\text {z}}_{i,\backslash t}}=\frac{\sum \nolimits _{k=1,k\ne t}^{c}{{{e}^{{{p}_{i,k}}}}}}{\sum \nolimits _{j}{{{e}^{{{p}_{i,j}}}}}}. \end{aligned}$$
(2)

Meanwhile, another vector \({{\overset{\wedge }{\mathop {z}}\,}_{i}}\) stands for the possibilities among non-target classes (i.e., the t-th class is not considered),

$$\begin{aligned} \begin{matrix} {{\overset{\wedge }{\mathop {z}}\,}_{i}}=\left[ {{\overset{\wedge }{\mathop {z}}\,}_{_{i,1}}},\cdots ,{{\overset{\wedge }{\mathop {z}}\,}_{_{i,t-1}}},{{\overset{\wedge }{\mathop {z}}\,}_{_{i,t+1}}},\cdots ,{{\overset{\wedge }{\mathop {z}}\,}_{i,c}} \right] , \\ s.t.,{{\overset{\wedge }{\mathop {z}}\,}_{_{i,k}}}=\frac{{{e}^{{{p}_{i,k}}}}}{\sum \nolimits _{j=1,j\ne t}^{c}{{{e}^{{{p}_{i,j}}}}}}. \\ \end{matrix} \end{aligned}$$
(3)

Thereby, TCKD loss and NCKD loss can be defined as:

$$\begin{aligned} \begin{matrix} {\ell }_{TCKD}\text {=KL}\left( b_{i}^{T}||b_{i}^{S} \right) ,{\ell }_{NCKD}\text {=KL}\left( {{\overset{\wedge }{\mathop {z}}\,}_{i}^{T}}||{{\overset{\wedge }{\mathop {z}}\,}_{i}^{S}} \right) , \\ s.t.,{{b}_{i}}={{\left[ z_{i,t},\text { z}_{i,\backslash t} \right] }^{T}}\in {{\mathbb {R}}^{2\times 1}}, \end{matrix} \end{aligned}$$
(4)

where KL is Kullback-Leibler divergence, T and S represent the teacher and student model, respectively. By adjusting the strength of \({\ell }_{TCKD}\) and \({\ell }_{NCKD}\), the student model can selectively learn the knowledge from well-predicted samples.

Apart from learning from the teacher model, we further explore adjustable learning from current samples by introducing PolyLoss [22]. PolyLoss provides a unified view on common loss functions for classification problems. It defines loss function as a linear combination of polynomial functions, under polynomial expansion, focal loss is a horizontal shift of the polynomial coefficients compared to cross entropy loss. In this paper, PolyLoss is used to calculate the loss between predictions and ground-truth labels so as to guide the student model’s training toward good optimization. The formulation of PolyLoss is defined as follows:

$$\begin{aligned} {{\ell }_{Poly-N}}=-\log \left( {{P}_{t}} \right) +\sum \limits _{k=1}^{N}{{{\beta }_{k}}{{\left( 1-{{P}_{t}} \right) }^{k}}}, \end{aligned}$$
(5)

where \({{P}_{t}}\) denotes the model’s prediction probability of the target ground-truth class, N represents the number of leading coefficients. With this flexible form, PolyLoss can easily adjust the importance of different polynomial bases according to the targeting datasets and tasks, consistently improving the performance. In this work, we set \(N=1\) and it achieves significant improvement.

Finally, the optimization problem of DSKD can be written as,

$$\begin{aligned} \begin{aligned}&{{\phi }^{'}},{{\theta }^{'}}=\arg \underset{{{\phi }^{'}},{{\theta }^{'}}}{\mathop {\min }}\,{{\mathbb {E}}_{(x,y)\sim {D}^{base}}}\\&\left[ {{\eta }_{1}}{{\ell }_{Poly-1}}\left( {{g}_{{{\theta }^{'}}}}\left( {{f}_{{{\phi }^{'}}}}\left( x \right) \right) ,y \right) \right. \\&+{{\eta }_{2}}\left[ {{\lambda }_{1}}{{\ell }_{TCKD}}\left( {{g}_{{{\theta }^{'}}}}\left( {{f}_{{{\phi }^{'}}}}\left( x \right) \right) ,{{g}_{\theta }}\left( {{f}_{\phi }}\left( x \right) \right) ,y \right) \right. \\&\left. \left. +{{\lambda }_{2}}{{\ell }_{NCKD}}\left( {{g}_{{{\theta }^{'}}}}\left( {{f}_{{{\phi }^{'}}}}\left( x \right) \right) ,{{g}_{\theta }}\left( {{f}_{\phi }}\left( x \right) \right) ,y \right) \right] \right] \\ \end{aligned} \end{aligned}$$
(6)

where \({{\eta }_{2}}=1-{{\eta }_{1}}\), \({{\lambda }_{1}}\) and \({{\lambda }_{2}}\) are the weights of \({\ell }_{TCKD}\) and \({\ell }_{NCKD}\), respectively. Our proposed DSKD method enables the student model to selectively learn knowledge from the teacher model and current samples.

4 Experiments

In this section, we elaborate on the experimental configuration and evaluation. The description of the datasets and implementation details are first presented, followed by experimental results of our approach on popular benchmark datasets. Finally, an ablative analysis is presented.

4.1 Datasets and implementation details

Datasets

We conduct extensive experiments on four popular FSL benchmark datasets, i.e., miniImageNet [42], tieredImageNet [35], Fewshot-CIFAR100 (FC100) [31] and Caltech-UCSD Birds-200-2011 (CUB) [43]. For miniImageNet, as a subset of ImageNet, it includes 100 classes and each class contains 600 images. We follow the splitting protocol proposed in [33], with 64 classes for training, 16 classes for validation and 20 classes for testing. The tieredImageNet contains 608 classes, which can be grouped into 34 high-level categories. We use 20 categories (351 classes), 6 categories (97 classes), and 8 categories (160 classes) for training, validation and testing, respectively. FC100 is derived from CIFAR100 dataset and employs a split similar to tieredImageNet, which can be divided into 60 training classes, 20 validation classes and 20 testing classes. Each class includes 100 images. The CUB was originally used for fine-grained bird classification, which has 11,788 images from 200 classes. We follow the split division in [6] that 100, 50, and 50 classes are grouped for training, validation and testing, respectively.

Implementation details

To make a fair comparison with recent works [41], we use ResNet12 as our backbone. We follow [21] to apply Dropblock as a regularizer and adjust the amount of filters from (64, 128, 256, 512) to (64, 160, 320, 640). Besides, a 4-neuron fully-connected layer is applied after the final classification layer. Each batch contains 64 samples. We use SGD with a momentum of 0.9 and a weight decay of \(5{{e}^{-4}}\). The initial learning rate is set as 0.05 and decayed with a factor of 0.1. We train 100 epochs and decay twice for miniImageNet and CUB, 60 epochs and decay three times for tieredImageNet, 65 epochs and decay once for FC100. The same learning schedule is applied during distillation. Besides, we use random cropping, color jittering and random horizontal flip for data augmentation during the whole process. Further, for the hyper-parameters, we set the temperature coefficient as 4.0 and \({{\eta }_{1}}={{\eta }_{2}}=0.5\), where \(\alpha ,\text { }{{\beta }_{1}},\text { }{{\lambda }_{1}}\) and \({{\lambda }_{2}}\) are tuned on the validation dataset. For evaluation, we train a N-way logistic regression classifier.

Table 1 Comparison of DSKT (our approach) to prior works on miniImageNet and tieredImageNet datasets, with mean accuracy (%) and 95% confidence interval
Table 2 Comparison of DSKT (our approach) to prior works on FC100 dataset, with mean accuracy (%) and 95% confidence interval
Table 3 Comparison of DSKT (our approach) to prior works on CUB dataset, with mean accuracy (%) and 95% confidence interval
Table 4 FSL results on CUB, with different combinations of loss functions

4.2 Evaluations

miniImageNet and tieredImageNet

Table 1 presents a comparison between our approach and the state-of-the-art methods in FSL tasks on the two ImageNet-based benchmarks. Our method is denoted as DSKT, where DSKT-1 and DSKT-2 represent the first and second stage, respectively. On both datasets and in both 1-shot and 5-shot scenarios, our approach yields state-of-the-art results. On miniIamgeNet, DSKT-1 achieves 65.88% and 83.09% in 5-way 1-shot and 5-way 5-shot tasks, respectively. This shows a gain of 1.06% and 0.68% over RFS-distill. The improvement becomes more substantial after distillation, with DSKT-2 producing a gain of 2.51% and 1.78% over RFS-distill under 5-way 1-shot and 5-way 5-shot settings, respectively. On tieredImageNet, DSKT-1 achieves improvements over RFS-distill by 0.08% and 0.58% in the 1-shot and 5-shot tasks, respectively. The improvements are 0.59% and 0.66% with DSKT-2.

FC100

Table 2 illustrates similar comparisons, this time on FC100. Here, DSKT provides accuracy improvements in all cases. For DSKT-1, the improvements over RFS-distill for 1-shot and 5-shot are 0.8% and 1.8%, respectively. Furthermore, the addition of distillation (DSKT-2) shows an exclusive improvement of 1.2% under 5-way 1-shot and 1.0% under 5-way 5-shot settings.

CUB

Table 3 compares our approach DSKT, against the state-of-the-art on CUB for fine-grained classification. Here, our method outperforms previous work even if they are implemented with deeper backbones. In particular, DSKT-2 achieves improvements over the best-reported numbers by 2.67% and 2.07% in 5-way 1-shot and 5-way 5-shot scenarios, respectively.

4.3 Ablative analysis

Benefits of multi-task learning

To study the impact of multi-task learning, we evaluate our approach with and without the rotation loss on CUB. As shown in Table 4, we first simply train the DSKT-1 with cross entropy loss, which is similar to RFS-simple [41], the model achieves 71.74% and 87.23% in 5-way 1-shot and 5-way 5-shot tasks, respectively. Then, we train the DSKT-1 with additional rotation loss, the model performance improves to 76.26% and 90.76%, which presents an absolute gain of 4.52% and 3.53%. From the results, we can infer the importance of multi-task learning during the training process.

Benefits of DSKD

To better evaluate the contribution of DSKD, we train the model with different combinations of loss functions on CUB. From Table 4, we can find that, for models trained only with cross entropy loss in the first stage, DSKD gives 1.01% and 0.73% gains compared with KD (KD is defined as \({{\ell }_{KD}}=KL\left( p_{i}^{T}p_{i}^{S}\right) \), where \(p_{i}\) denotes the output logits of the i-th sample, and T and S represent the teacher and student model, respectively.) in 1-shot and 5-shot tasks, respectively. For models trained with both cross entropy and binary cross loss functions, DSKD still achieves 0.45% and 0.39% improvements than KD. These results confirm the effectiveness of DSKD.

Table 5 FSL results on CUB, with different processing methods in the first stage
Table 6 FSL results on CUB, with different classifiers during the evaluation process

Benefits of rotation and logistic regression classifier

To study the impact of rotation and logistic regression classifier, we evaluate our approach with other processing methods and classifiers on CUB. Table 5 presents the results on CUB with different processing methods in the first stage (DSKT-1). From Table 5 we can find that rotation performs much better than fusion (Three channels of RGB images are fused and the coefficient of each channel is set to 0.5 in this work). When compared to puzzle (split the original image into pieces, randomly sort and then reassemble. The puzzle size is set to 7 on CUB 5-way tasks.), rotation still obtains 0.5% and 1% improvements on CUB 5-way 1-shot and 5-shot tasks, respectively. Table 6 shows the FSL results with different classifiers on CUB during the evaluation process. From Table 6, we can find that logistic regression classifier achieves the best performance when compared to Nearest classifier (make predictions based on the euclidean distances between query and support samples) and Cosine classifier (make predictions based on the cosine similarities between query and support samples) on both CUB 5-way 1-shot and 5-shot tasks during the evaluation process.

Table 7 FSL results on CUB, with different inputs for the rotation classifier in the first stage
Table 8 Results of different degrees of polynomial in PolyLoss on CUB

Benefits of using the output of the category classifier as input for the rotation classifier

In this paper, we exploit the output of category classifier as input for the rotation classifier, rather than that of the feature extractor. This can help improve the model’s performance on complex data. Table 7 present the comparison results on CUB. From Table 7, we can find that applying the output of the category classifier gains 0.89% and 0.66% improvements in CUB 5-way 1-shot and 5-shot tasks, respectively.

Different degrees of polynomial in PolyLoss

The degree of polynomial in PolyLoss N is set to 1 in this work. To study the impact of the degree, we assign the degree with different values in CUB 5-way tasks. From Table 8, we can find that when \(N=1\), our method achieves around 0.8% and 0.3% improvements compared to \(N=2\) or \(N=3\) in CUB 5-way 1-shot and 5-shot tasks, respectively.

Application of the rotation loss

In this work, the rotation loss is directly applied to the predicted logits of the category classifier. To learn its contribution, we make a comparison with applying rotation loss on the features in CUB 5-way tasks. Table 9 shows that compared with applying multi-task loss on the features, applying on the category logits achieves around 0.5% and 0.9% improvements on CUB 5-way 1-shot and 5-shot tasks, respectively.

Fig. 2
figure 2

Ablative Analysis

Table 9 Results of different applications of rotation loss on CUB

Variations of Hyper-parameters

There are totally seven hyper-parameters in this work, \(t,\text { }\alpha ,\text { }{{\beta }_{1}}\), \(\text { }{{\lambda }_{1}},\text { }{{\lambda }_{2}},\text { }{{\eta }_{1}}\) and \({{\eta }_{2}}.\) \(\text { }t\) denotes the temperature coefficient, \(\alpha \) controls the contribution of rotation loss during DSKT-1, \({{\beta }_{1}}\) is used to adjust the contribution of the first polynomial base, \({{\lambda }_{1}}\), \({{\lambda }_{2}}\), \({{\eta }_{1}}\) and \({{\eta }_{2}}\) are the weights of different loss functions. We mainly investigate the variants of hyper-parameters \({\alpha }\), \({{\beta }_{1}}\), \({{\lambda }_{1}}\) and \({{\lambda }_{2}}\), where the default values of \(t, {{\eta }_{1}}\) and \({{\eta }_{2}}\) are 4.0, 0.5 and 0.5, respectively. Figure 2a shows the DSKT-1 performance on CUB 5-way 1-shot tasks by changing \(\alpha \). It’s obviously that our model performance increases from 0.5 till 2, and then decreases when \(\alpha =5\), which indicates the importance of multi-task learning. Figure 2c shows the DSKT-2 performance on CUB 5-way 1-shot tasks by changing \({{\beta }_{1}}\). We observe that DSKT-2 achieves over 78% when \({{\beta }_{1}}\) increases from 1 till 5, while \({{\beta }_{1}}=10\) performs the lowest. Note that the performance drops only by about  0.6%, which indicates that it is not sensitive to the value of \({{\beta }_{1}}\). Figure 2b presents the DSKT-2 performance on CUB 5-way 1-shot tasks by changing \({{\lambda }_{1}}\). When \({{\lambda }_{1}}\) ranges from 0.5 till 5, DSKT-2 can always obtains over 77% accuracies and achieves the highest accuracy when \({{\lambda }_{1}=1.0}\), indicating it is not sensitive to the value of \({{\lambda }_{1}}\). Figure 2d presents the DSKT-2 performance on the same tasks by changing \({{\lambda }_{2}}\). We find that the performance increases from 1 to 2, then decreases with larger values of \({{\lambda }_{2}}\). These results indicate that it is not a good idea to blindly increase the weights of \({\ell }_{NCKD}\).

5 Conclusion

In this work, we aim to raise awareness of the importance of training a well-generalized feature extractor for FSL tasks by proposing a new two-stage dual selective knowledge transfer framework. First, we use multi-task learning to enforce the feature extractor to learn robust low-level features. Then, we propose an effective dual knowledge distillation method, which enables the student model to selectively learn knowledge from the teacher model and current examples, further improving the model’s generalization ability. Extensive experimental results demonstrate the importance of strong feature extractors for FSL and our approach outperforms the state-of-the-art on four popular FSL benchmark datasets. Despite the promising results in our study, there are some areas that worth further investigation. We only apply our method on classification tasks in this paper, object detection tasks will be our future research direction The correlation between the distillation performance and polynomial function coefficients \({\beta }_{i}\) is not fully investigated, we will expand upon our work in future researches.