1 Introduction

In recent years, deep neural networks have achieved excellent performance on various challenging tasks, such as image classification [1, 2], semantic segmentation [3, 4], object detection [5, 6], text recognition [7, 8] and voice recognition [9, 10]. However, such powerful performance relies heavily on large-scale labeled samples and preparing sufficient human-annotated datasets is often laborious and expensive in practice. To alleviate the performance of the model in the case of limited samples, the GAN-based methods [11, 12] generate new data through model learning to expand the number of samples of the original dataset. Domain adaptation methods [13,14,15], on the other hand, enhance the model’s transferability by aligning or adjusting the data distributions between the source and target domains. Few-shot learning (FSL), inspired by human learning, can easily absorb new concepts from only few examples. FSL aims to recognize a set of novel classes (query set) with few labeled data utilizing the knowledge learned from a set of base classes (support set) with abundant labeled samples[16].

Metric-based few-shot learning method is a simple and effective solution [17,18,19] which can achieve promising performance. The metric-based method usually contains a feature encoder and a metric function. The feature encoder is used to map samples into feature vector. The metric function is applied to compute the similarity between the query sets and the support sets. Then the query sample can be accurately categorized based on the nearest neighbor principle. In order to improve the accuracy of metric-based methods, the existing approaches attempt to train a well-performing feature encoder that matches with a fixed metric, such as MatchingNet [20], or directly to learn an appropriate metric function via a CNN model, such as RelationNet [18]. Attention mechanism can divert the focus of the model to the most important regions of an image and disregard irrelevant parts [21], which shows great advantages in feature extraction stage. The primary challenge in the few-shot learning problem is the scarcity of samples, and the effective utilization of limited sample features is of paramount concern. In the image classification tasks, global features can provide insights into the commonalities and overall differences between classes. Furthermore, by capturing local features that are relevant to specific objects or categories, the model can become less sensitive to irrelevant information, thereby enhancing the distinctiveness of images and facilitating the discrimination between different categories. Attention-based architectures have been widely applied for FSL, such as cross attention network [22], self-attention relation network [23] and adaptive attention [24]. However, only one or two kinds of attentions are involved in these methods which has restricted the extraction of feature information to limited dimensions.

In this paper, we propose PANet, a two-phase attention-oriented method to address this challenge. As illustrated in Fig. 1, it mainly consists of three modules: feature embedding module, local encoded intra-attention(LEIA) module and global encoded reciprocal attention(GERA) module. We combine the encode regional spatial [25] and adjacent channel attention together in a sequential way to form the LEIA, which aims to capture comprehensive local feature dependencies in every single sample and prepare informative inputs for the GERA module. The GERA module incorporates global attention maps of both the support and query images to locate the most relevant regions in support-query image pairs, thus enhancing the discriminability of representations obtained from the LEIA module. The two modules are mutually complementary from two different aspects and ensure the feature information within or between images can be both fully utilized. In addition, we also propose a dual-centralization (DC) cosine similarity method which adopts centralizing operations during the cosine similarity calculation. DC can eliminate the disparity of data distribution in different dimensions and enhance the accuracy and reliability of the metric. Our model achieves new state-of-the-art performances. We also conduct a series of ablation experiments and the results demonstrate that each key component in our proposed model plays a critical role. Our contributions can be summarized as follows:

  • We focus on adjacent channel attention and design a novel local encoded intra-attention module, which can learn local feature dependencies between samples along both channel and spatial dimensions.

  • We propose a novel global encoded reciprocal attention module for few-shot image classification, which captures discriminative global features by interactively learning spatial and channel attention between the support and query samples.

  • We design a dual-centralization (DC) cosine similarity metric, which performs twice centering operations on different dimensions of the feature vector to alleviate the disparity in data distribution and improve the reliability of cosine similarity.

Fig. 1
figure 1

Overall framework of our method. The PANet contains three modules: feature embedding module, local encoded intra-attention(LEIA) module and global encoded reciprocal attention(GERA) module. The feature embedding module generates the feature maps for the input images. The LEIA module refines regional spatial and adjacent channel attention upon feature maps. The GERA module computes the encoded reciprocal spatial and channel attention between the support and query representations

2 Related Work

2.1 Few-Shot Learning

Few-shot learning (FSL) aims to learn a classifier from a set of base classes with abundant labeled samples, then adapt to a set of novel classes with few labeled data [26]. Existing studies on FSL can be mainly divided into two categories. Metric-based methods aim to learn a good feature encoder and an appropriate metric function. The feature encoder provides reliable feature maps for the metric function to categorize novel samples according to the nearest neighbor principles. For example, ProtoNet [17] utilizes CNN to extract features and minimizes the euclidean distance between the query samples and the category prototype of support samples to make the correct classification. DeepEMD [27] also uses CNN as the feature encoder, which employs the earth mover distance to compare the discriminative representations composed of local features for image classification tasks. In the work of [28], a cycle optimization metric network is proposed to explore the commutative relationship between support and query samples, achieving performance improvement through a forward-backward network framework guided by the cycle-consistency loss. Optimization-based methods can quickly adapt to novel tasks in a few optimization steps. For example, MAML [29] manages to seek a good initialization that can be quickly adapted to new tasks with fine-tuning. MetaOptNet [30] proposes to formulate the use of linear classifiers as convex learning problems and exploit gradient-based techniques to obtain the suitable parameters of the feature encoder that generalizes well across tasks.

2.2 Few-Shot Learning on Image Classification

Few-shot image classification is a common and important task in few-shot learning, which primarily focuses on the challenge of classifying new categories with only a limited number of samples. Tseng et al. [31] design a feature transformation layer that enhances image features through affine transformations to simulate various feature distributions in different domains during the training phase. This allows the model to capture variations in feature distributions across different domains, resulting in strong generalization capabilities for cross-domain few-shot image classification tasks. Li et al. [32] employ a weight imprinting strategy and design a knowledge transfer mechanism to extensively exploit prior knowledge for inferring information about new categories. They also use a fine-tuning strategy during the training phase to simulate various feature distributions in different domains, reducing the impact of changes in the data domain on the model. Cao et al. [33] propose COMET, which learns feature embeddings of concepts through an independent concept learner and compares them with concept prototypes. COMET effectively summarizes information about concept dimensions, assigns a concept importance score to each dimension, and enhances the accuracy of fine-grained few-shot image classification. Huang et al. [34] design a novel Low-Rank Pairwise Alignment Bilinear Network (LRPABN) to capture subtle differences between support samples and query samples in fine-grained images to learn an effective distance metric.

2.3 Attention Mechanism

Attention mechanism is widely used in computer vision tasks. Intra-attention pays more attention to the internal connections of the input data. The queries, keys, and values in intra-attention come from the same data source, and are all obtained by the same matrix through different linear transformations. Shaw et al. [35] extend the intra-attention mechanism to efficiently consider representations of the relative positions between sequence elements. Yang et al. [36] model localness for intra-attention networks to enhance the ability of capturing useful local information. Pr et al. [37] design a simple intra-attention layer that performs global attention across all pixels, it also reduces training parameters while maintaining network performance. Reciprocal attention mechanism was first proposed for video question answering task [38] and then has been widely used in various computer vision tasks. For example, Lu et al. [39] utilize co-attention in video object segmentation between frames sampled from a video sequence. In few-shot learning, reciprocal attention focuses on features of both the query and support images at the same time, especially the most relevant regions in support-query image pairs. Hsieh et al. [40] deal with one-shot object detection task based on co-attention mechanism. Hou et al. [22] develop a meta-learner to compute the cross attention between support and query feature maps. Inspired by these works, we design a brand new adjacent channel attention that extracts adjacent channel groups of the central channel to generate attention maps. We also embed relative positional encoding into adjacent channel attention to boost feature representation by exploiting positional information. Then we combine the regional spatial and adjacent channel attention together to form the local encoded intra-attention module, which can capture more comprehensive and meaningful local feature information along two principal dimensions.

3 Methodology

3.1 Problem Definition

FSL aims to classify new categories utilizing few labeled samples. To alleviate the overfitting problem with such a small number of samples, we adopt the meta-learning framework with episodic training strategy. We denote the training set and the testing set in the form of image-label pairs, \({D_{train}=\{X_{tra},Y_{tra}}\}\) and \(D_{test}=\{X_{tst},Y_{tst}\}\), respectively, \(Y_{tra} \cap Y_{tst} = \emptyset \). Both \(D_{train}\) and \(D_{test}\) consist of multiple episodes. The episodes used in training simulate the test process, which can maintain the consistency of the training and testing stage. As for each episode, also called N-way K-shot task, is formed by randomly extracting N classes and K labeled images per class as the support set \(S=\left\{ x_{s}^{i}, y_{s}^{i}\right\} _{i=1}^{N\times K}\). Then randomly select r samples for each class in the remaining images of N classes as the query set \(Q=\left\{ x_{q}^{j}, y_{q}^{j}\right\} _{j=1}^{N\times r}\). We iteratively sample a N-way K-shot episode from \(D_{train}\) to learn a good model. The model learned in this way can be trained with a small number of support samples to achieve rapid identification of query samples for new tasks.

3.2 Overall Framework

As shown in Fig. 1, PANet mainly consists of three components: feature embedding module, local encoded intra-attention(LEIA) module and global encoded reciprocal attention(GERA) module. Given a pair of support and query images as input, the feature encoder can generate the base feature maps, \(F_s\) and \(F_q\in \mathbb {R} ^{C\times H\times W}\), where C denotes the channel size and \(H\times W\) denotes the spatial size of each feature map. Then the LEIA module sequentially focuses on important regions of base feature maps in space and channel dimensions to learn the regional spatial attention representations \(P_s\), \(P_q\in \mathbb {R} ^{C\times H\times W}\) and the adjacent channel attention representations \(D_s\), \(D_q\in \mathbb {R} ^{C\times H\times W}\). Following the LEIA module, the GERA module highlights the relevant regions and computes reciprocal spatial and channel attention maps \(A_{s}^{sp}\) and \(A_{q}^{sp}\), \(A_{s}^{ch}\) and \(A_{q}^{ch}\) between the pair of image representations, where superscript sp denotes the spatial dimension, ch denotes the channel dimension. Then the attended features are aggregated to produce the global attention representations, \(G_s\) and \(G_q\). The above process is conducted to all support images in parallel, and the query sample can be categorized as the class of its nearest support embedding according to the dual-centralization cosine similarity.

3.3 Local Encoded Intra-Attention(LEIA) Module

The LEIA module is designed to capture comprehensive local features of an image, involving both the encoded regional spatial attention and the adjacent channel attention. In our LEIA module, \(F_s\) and \(F_q\) sequentially pass through regional spatial attention and adjacent channel attention module to obtain local encoded intra-attention representations, \(D_s\) and \(D_q\). \(D_s\) and \(D_q\) contain both regional spatial information and adjacent channel information, improving the quality of feature representation. The Figs. 2 and 3 describe the two main components of LEIA module.

Fig. 2
figure 2

Illustration of regional spatial attention with spatial extent k

3.3.1 Regional Spatial Attention

Following the structure of stand-alone self-attention [37], our module also can capture the regional spatial information in feature representation. Figure 2 illustrates the detailed operation of regional spatial attention. We first apply unfold operation to extract the local region \(L\in \mathbb {R} ^{C\times k\times k}\) with spatial extent k centered around every pixel \(f\in \mathbb {R} ^{C\times 1\times 1}\) in feature maps \(F\in \mathbb {R} ^{C\times H\times W}\). Then we encode the position information into every local region and calculate the regional spatial attention exerted on every single pixel. Finally, we obtain the pixel \(p\in \mathbb {R} ^{C\times 1\times 1}\) with regional spatial attention and all the pixels with regional spatial attention can be represented as \(P\in \mathbb {R} ^{C\times H\times W}\).

Given a pixel \(f_{(i,j)}\in \mathbb {R} ^{C\times 1\times 1}\) and its local region of pixels in positions \((x,y)\in L_k\left( i,j \right) \), \(p_{(i,j)}\) can be computed as:

$$\begin{aligned} p_{(i,j)}=\sum _{x, y \in {L}_{k}(i, j)}softmax\left[ q_{(i,j)}^{\top } \left( k_{(x,y)}+e_{(x-i, y-j)}\right) \right] v_{(x,y)}+f_{(i,j)}, \end{aligned}$$
(1)

where the query \(q_{(i,j)}=W_{Q} f_{(i,j)}\), keys \(k_{(x,y)}=W_{K} f_{(x,y)}\), and values \(v_{(x,y)}=W_{V} f_{(x,y)}\) are linear transformations of the center pixel \(f_{(i,j)}\) and its neighbor pixels \(f_{(x,y)}\) in local region L. \(W_{Q}\), \(W_{K}\), \(W_{V}\) are all independent linear transformation layers. A row offset embedding \(e_{x-i}\in \mathbb {R} ^{\frac{C}{2}\times H\times 1}\) and a column offset embedding \(e_{y-j}\in \mathbb {R} ^{\frac{C}{2}\times 1\times W}\) are joined to form the relative positional embedding \(e_{(x-i, y-j)}\). The \(e_{(x-i, y-j)}\) is utilized to encode position information for \(f_{(x,y)}\). The computation in Eq. 1 is applied for every pixel \(f_{(i, j)}\) and the output \(p_{(i,j)}\) is concatenated together to generate the regional spatial attention representation P.

Fig. 3
figure 3

Illustration of adjacent channel attention with channel extent k

3.3.2 Adjacent Channel Attention

We propose a novel channel attention module to learn the latent relationships between channels of feature maps. Given the regional spatial attention representation P as input, we use pooling operation to aggregate spatial information and also apply unfold operation to extract the adjacent group \(J\in \mathbb {R} ^{k\times 1\times 1}\) with extent k centered around every channel. Then we compute the adjacent channel attention exerted on every single channel and adopt aggregation to acquire the \(d\in \mathbb {R} ^{1\times 1\times 1}\) with adjacent channel information. Finally, all the learned channels are combined together to form the adjacent channel attention representation \(D\in \mathbb {R} ^{C\times H\times W}\). Figure 3 shows the details process of our adjacent channel attention. Given a channel \(p_i\) and its neighbor channel \(p_{x}\) in adjacent group J where \(x\in J_{k}\left( i\right) \), the output \(d_{i}\) is computed as follows:

$$\begin{aligned} d_{i}=\sigma \left( \sum _{x \in J_{k}(i)} softmax\left[ q_{i}^{\top } \left( k_{x}+e_{x-i}\right) \right] v_{x}\right) p_i+p_i, \end{aligned}$$
(2)

where \(\sigma \) is a Sigmoid function. We use offset \(e_{x-i}\in \mathbb {R} ^{k\times 1\times 1}\) to represent the relative channel embedding which can be utilized to encode position information for \(p_x\). The query \(q_i\), keys \(k_x\) and values \(v_x\) in Eq. 2 are denoted below:

$$\begin{aligned} q_i= & {} avgpool(W_{Q}(p_{i}))+maxpool(W_{Q}(p_{i})), \end{aligned}$$
(3)
$$\begin{aligned} k_x= & {} avgpool(W_{K}(p_{x}))+maxpool(W_{K}(p_{x})), \end{aligned}$$
(4)
$$\begin{aligned} v_x= & {} avgpool(W_{V}(p_{x}))+maxpool(W_{V}(p_{x})). \end{aligned}$$
(5)

Where \(W_Q\), \(W_K\) and \(W_V\) indicate independent linear transform operation. To compute the channel attention efficiently, we apply both the average-pooling and max-pooling to aggregate spatial information.

Fig. 4
figure 4

Overall architecture of the proposed global encoded reciprocal attention(GERA) module. The GERA module encodes the position information and computes both the reciprocal spatial and channel attention between the pair of local attention representations

3.4 Global Encoded Reciprocal Attention(GERA) Module

We design a global encoded reciprocal attention(GERA) module to capture discriminative global features by interactively learning spatial and channel attention between the support and query samples. Figure 4 describes the structure of GERA module. The GERA module first transforms the local representations \(D_{s}\) and \(D_{q}\) to transformation representations \(T_{s}\) and \(T_{q}\). Then the module encodes the position information and provides the intermediate representations \(I_{s}\) and \(I_{q}\). After that, the module performs cross operation between the support and query representations and obtains the reciprocal attention maps \(A_{s}^{sp}\), \(A_{q}^{sp}\), \(A_{s}^{ch}\) and \(A_{q}^{ch}\) in both the spatial and channel dimensions. Finally, the global attention representations \(G_s\) and \(G_q\) are acquired by multiplying attention maps with the input representations of the GERA module in a sequential manner. The detailed computation is presented below.

The linear transformation for the local attention representations \(D_{s}\) and \(D_{q}\) can be represented as:

$$\begin{aligned} T_s=W_{s}H_{s} \ ,\ T_{q}=W_{q}H_{q}, \end{aligned}$$
(6)

where \(W_{s}\) and \(W_{q}\) represent independent linear transform operation. \(T_{s}\) and \(T_{q}\in \mathbb {R} ^{C\times H\times W}\) denote the transformation representations. To better utilize the position information, we apply the relative position embedding \(E_{s}\) and \(E_{q}\) to encode \(T_{s}\) and \(T_{q}\):

$$\begin{aligned} I_s=T_{s} \oplus E_{s} \ , \ I_q=T_{q} \oplus E_{q}, \end{aligned}$$
(7)

where \(\oplus \) denotes the element addition. \(I_s\) and \(I_q\in \mathbb {R} ^{C\times H\times W}\) are intermediate representations. Then the reciprocal attention maps are calculated in turn:

$$\begin{aligned} A_{s}^{sp}= & {} softmax\left( T_{q}^{\top }\otimes I_{s}\right) , \end{aligned}$$
(8)
$$\begin{aligned} A_{q}^{sp}= & {} softmax\left( T_{s}^{\top }\otimes I_{q}\right) . \end{aligned}$$
(9)

Here, \(\otimes \) denotes matrix multiplication, superscript sp denotes the spatial dimension, \(A_{s}^{sp}\) and \(A_{q}^{sp}\in \mathbb {R} ^{(H\times W)\times (H\times W)}\) denote reciprocal spatial attention maps.

$$\begin{aligned} A_{s}^{ch}= & {} \sigma \left( GP(T_{q})\odot GP(I_{s})\right) , \end{aligned}$$
(10)
$$\begin{aligned} A_{q}^{ch}= & {} \sigma \left( GP(T_{s})\odot GP(I_{q})\right) , \end{aligned}$$
(11)

where \(\odot \) denotes the element multiplication. GP is the global pooling operation involving both the average-pooling and max-pooling. Superscript ch denotes the channel dimension. \(A_{s}^{ch}\) and \(A_{q}^{ch}\in \mathbb {R} ^{C\times 1\times 1}\) represent reciprocal channel attention maps. Finally, the global attention representations \(G_{s}\) and \(G_{q}\in \mathbb {R} ^{C\times H\times W}\) can be gained by the following calculation:

$$\begin{aligned} G_{s}= & {} A_{s}^{ch}\odot (A_{s}^{sp}\otimes D_{s})^{\top }+D_{s}, \end{aligned}$$
(12)
$$\begin{aligned} G_{q}= & {} A_{q}^{ch}\odot (A_{q}^{sp}\otimes D_{q})^{\top }+D_{q}, \end{aligned}$$
(13)

\(G_{s}\) and \(G_{q}\) are the ultimate representations of PANet owned with pluralistic attentions: regional spatial attention, adjacent channel attention, global spatial attention and global channel attention. These attentions are complementary to each other and conspicuously enhance the distinctiveness of features obtained from both the support and query samples.

3.5 Dual-centralization (DC) Cosine Similarity

In this section, we design a new metric approach to compute the similarity between the support and query representations. We first aggregate the spatial information of \(G_s\) and \(G_q\) using the global pooling operation and get final embeddings s and \(q\in \mathbb {R} ^{C}\). Then we perform dual-centralization on different dimensions of these embeddings to enhance the reliability of cosine similarity. The complete calculation process is as follows:

$$\begin{aligned} s^{\prime }= & {} s-\frac{1}{N}\sum _{i=1}^N{s_i}-\frac{1}{C}\sum _{j=1}^C{\left[ s-\frac{1}{N}\sum _{i=1}^N{s_i}\right] ^{j}}, \end{aligned}$$
(14)
$$\begin{aligned} q^{\prime }= & {} q-\frac{1}{M}\sum _{i=1}^M{q_i}-\frac{1}{C}\sum _{j=1}^C{\left[ q-\frac{1}{M}\sum _{i=1}^M{q_i}\right] ^{j}}, \end{aligned}$$
(15)

where N is the total number of support embeddings, M is the total number of query embeddings, and C is the total number of channels. The first centralization is applied among all the support or query embeddings. The second centralization is applied to the channel dimension of each embedding. After two centralizing operations, we get the adjusted embeddings \(s^{\prime }\) and \(q^{\prime }\in \mathbb {R} ^{C}\). Before calculating the similarity, we average the K support embeddings for each class to obtain a set of prototype embeddings \(\overline{s^{\prime }}\):

$$\begin{aligned} \overline{s^{\prime }} = \frac{1}{K} \sum _{k=1}^{K}{s_k^{\prime }}. \end{aligned}$$
(16)

Then the cosine similarity is utilized to metric the distance between prototype and query embeddings. The similarity score logit is defined as:

$$\begin{aligned} logit=sim<\overline{s^{\prime }}\ , \ q^{\prime }>, \end{aligned}$$
(17)

where \(sim<\cdot>\) means the cosine similarity. By introducing two centralizing operations, our DC cosine similarity alleviates the disparity of data distribution on different dimensions, which makes the similarity metric more accurate and reliable compared with the general cosine similarity.

3.6 Loss Function

In the training process, the total loss L includes the global classification loss \(L_{global}\) and the episode loss \(L_{episode}\). The \(L_{global}\) learns inter-class relationships across the entire data distribution space by training a global classifier. The global classifier adopts a fully-connected layer followed by softmax to correctly classify query samples among overall I categories in the training set. And the \(L_{global}\) is computed as:

$$\begin{aligned} L_{global} = -\log \frac{\exp \left( w_{i}^{\top }f_q + b_i \right) }{\sum _{i=1}^{I}{\exp \left( w_{i^{\prime }}^{\top }f_q+b_{i^{\prime }} \right) }}, \end{aligned}$$
(18)

where \(f_q\in \mathbb {R} ^{C}\) is the average-pooled base representations. \([w_{1}^{\top }, w_{2}^{\top }, \cdots , w_{I}^{\top }]\) and \([b_1, b_2, \cdots , b_I]\) are weights and biases of the fully-connected layer, respectively. \(L_{episode}\) learns in each episode within few samples per class, which aims to calculate the similarity between query and support samples to realize the recognition of query samples. The \(L_{episode}\) is calculated as:

$$\begin{aligned} L_{episode} = -\log \frac{\exp \left( logit_{n} \right) }{\sum _{n^{\prime }=1}^{N}{\exp \left( logit_{n^{\prime }} \right) }}, \end{aligned}$$
(19)

where \(logit_{n}=sim<\overline{s_{n}^{\prime }}\, \ q_{n}^{\prime }>\) is the similarity score between the prototype of the \(n_{th}\) class and the query sample from the \(n_{th}\) class, \(logit_{n^{\prime }}\) is similar. N is the total number of the classes in a N-way K-shot episode.

The overall loss function is defined as:

$$\begin{aligned} L=L_{global}+\lambda L_{episode}, \end{aligned}$$
(20)

where \(\lambda \) is an adjusting hyperparameter that balances the effects of different loss terms. The model can be trained by optimizing L with gradient descent algorithm. The training process of our model is summarized in Algorithm 1.

Algorithm 1
figure a

Training process of PANet

4 Experiment

In this section, we evaluate our proposed PANet on four standard benchmarks (miniImageNet, CUB-200-2011, CIFAR-FS and tieredImageNet) and compare the results with recent state-of-the-art FSL methods on both 5-way 1-shot and 5-way 5-shot tasks. To verify the effectiveness of major components, we also perform a series of ablation studies and comparison experiments with other attention modules.

4.1 Dataset

miniImageNet is a subset of ImageNet [41]. It consists of 100 categories and each category includes 600 images. Following the instruction of [42], we split the dataset into train, val and test with 64, 16 and 20 classes respectively.

tieredImageNet is another subset of ImageNet. It consists of 608 categories and each category contains 1200 images. Following the setup provided by [43], the dataset is divided into 351 classes for training, 97 classes for validation and 160 classes for testing respectively.

CUB-200-2011 is a fine-grained bird classification dataset [44]. It comprises 11788 images from 200 classes. As suggested in [45], CUB is separated into 100, 50 and 50 categories for training, validation and testing respectively.

CIFAR-FS is a subset of CIFAR-100 [46]. It contains 600 classes and each class includes 100 images. Following the data preparation from [30], we divide the dataset into 64 train categories, 16 validation categories and 20 testing categories.

4.2 Implementation Details

Following the recent FSL works [47, 48], we employ ResNet12 as the backbone of our model and the input image size is set to 84\(\times \)84. During the training stage, the proposed PANet as well as the backbone with a fully-connected layer are trained parallel in a single stage which is different from the commonly used two-stage framework (pretraining followed by episodic meta-training) [19, 27]. Note that the fully-connected layer only provides auxiliary updates for the backbone in the training phase and it will be out of function during the validation and test period. We test 15 query samples for each class in an episode and adopt the average classification accuracy with 95% confidence intervals of randomly sampled 10,000 test episodes as the final results. We employ the stochastic gradient descent (SGD) as the optimizer with weight decay of 0.0005 and momentum of 0.9. The spatial and channel extent k are both set to 5. We train the model for 90 epochs with initial learning rate 0.1 and use a decay factor of 0.05 at the 60th, 70th and 80th epoch. All experiments are implemented by PyTorch toolkit.

Table 1 Results on miniImageNet and tieredImageNet datasets
Table 2 Results on CIFAR-FS dataset

4.3 Comparison with State-of-the-Art Methods

To evaluate the performance of our model, we conduct a series of comparison experiments on four standard FSL benchmarks in both 5-way 1-shot and 5-way 5-shot settings. As shown in Table 1, we conduct experiments on two classic datasets, miniImageNet and tieredImageNet. The results explicitly indicate that our model achieves the new state-of-the-art performance compared to other methods, especially in some with complex structure or larger backbone (WRN-28-10 [50, 52], ResNet34 [56], ViT [63], etc.). For example, our method outperforms the FewTRUE [63] 1.58% (1-shot), 1.21% (5-shot) on miniImageNet and 1.73% (1-shot), 1.49% (5-shot) on tieredImageNet. In addition, under the same backbone (ResNet12), compared with the recent method DeepEMDv2 [64], the performance gains of our method is 0.83% and 1.59% on miniImageNet, 0.40% and 0.94% on tieredImageNet. We also validate the performance of our model on CIFAR-FS dataset, as shown in Table 2. Compared with different backbones, our method is still competitive on few-shot classification tasks. It can be seen that our method achieves a 0.73% improvement on the 1-shot task and nearly 2% improvement on the 5-shot task compared to the state-of-the-art methods.

Table 3 Results on CUB-200-2011 dataset

CUB-200-2021 is a fine-grained datasets, and it is more challenging to use it for few-shot image classification tasks. It can be found in Table 3, our method outperforms BaseTransformer [67] 0.70% on 1-shot task and outperforms RENet [58] 0.93% on 5-shot task. The comparison of these results convincingly prove that extracting comprehensive features of a single image and enhancing the transferability between the support and query images can acquire outstanding performance in few-shot classification tasks.

Table 4 Comparison with other attention modules

4.4 Comparison with Other Attention Modules

We conduct comparison experiments with other conventional attention modules to verify the advantage of our PANet. The experiments are performed on miniImageNet and CIFAR-FS in 5-way 1-shot setting. We compared our PANet and sub-attention modules (LEIA and GERA) with both single attention methods that only uses one type attention module and multi attention methods that involves several different attention modules. As can be seen in Table 4, the majority of multi attention modules outperform our enhanced baseline, while the single attention modules performs almost as well as our enhanced baseline. GLAM[73] and RENet[58] reach nearly identical accuracies on miniImageNet compared with our two sub-attention modules. However, PANet can achieve much better performance compared with other attention modules by combining LEIA and GERA. We also calculated the parameter count of the model in different attention type in Table 4. Compared with single-attention models, PANet contains more attention mechanisms, therefore requiring more parameters to achieve higher classification accuracy. Among multi-attention models, PANet exhibits a significant performance advantage with a relatively similar parameter count. The comparative results underscore the feasibility of our method.

4.5 Ablation Study

Table 5 Ablation study on miniImageNet and CIFAR-FS

Analyze the effects of each component in our proposed method. In this work, we mainly propose three components: local encoded intra-attention(LEIA) module, global encoded reciprocal attention(GERA) module and dual-centralization (DC) cosine similarity. To investigate the effects of each component, we conduct thorough ablation experiments on miniImageNet and CIFAR-FS datasets in 5-way 1-shot setting. We adopt the ResNet12 with general cosine similarity as the baseline. Then we conduct ablation experiments to analyze the impact of the three components on the model. We first evaluate the effect of DC on the model and then carry out the following ablation experiments on the basis of enhanced baseline (baseline with DC). The detailed results in Table 5 demonstrate that each key component makes pivotal contribution to the final state-of-the-art performance. Our baseline can achieve 65.62\(\%\) on miniImageNet and 72.51\(\%\) on CIFAR-FS. We replace the general cosine similarity with our DC, the accuracy can increase by 1.13\(\%\) and 1.55\(\%\) respectively compared with the baseline. On this basis, we added LEIA to achieve 1.16\(\%\) and 0.86\(\%\) performance gains compared to the enhanced baseline. Then we only use GERA, which can achieve the same effect as LEIA, and has a performance improvement compared to the enhanced baseline. Finally, we achieve higher performance gains by combining LEIA and GERA together, improving more than 2\(\%\) compared with the enhanced baseline on miniImageNet and about 3\(\%\) over the the enhanced baseline on CIFAR-FS.

Fig. 5
figure 5

Loss of validation on miniImageNet dataset

Fig. 6
figure 6

Accuracy of validation on miniImageNet dataset

Fig. 7
figure 7

The effect of different arrangements of LEIA on miniImageNet, tieredImageNet, CUB-200-2011 and CIFAR-FS datasets in 5-way 1-shot setting

Moreover, we also analyze the training loss and classification accuracy on the miniImageNet in the 5-way 1-shot setting. Figure 5 shows the training loss of the model under different module combinations. For different combinations, the training loss of the model tends to converge after about 60 epochs. It also can be found that, PANet (DC+LEIA+GERA) has a lower loss compared with other methods, which shows that the combination of the three modules is more conducive to the optimization of the network. Figure 6 depicts the accuracy curves of the baseline and our different models, we can clearly see that our models have faster growth rate and higher accuracy values. Similarly, PANet achieves the highest classification accuracy. The ablation experiments strongly suggest that our proposed PANet is an effective method for FSL.

Arrangement of attention modules. In this experiment, we explore the effect of the arrangement of different submodules in LEIA on its performance. We compare five different ways of arranging submodules of LEIA: (1) regional spatial attention only, (2) adjacent channel attention only, (3) sequential local spatial-channel attention, (4) sequential local channel-spatial attention, and (5) parallel use of both attention modules. Since each module focuses on different dimensions, the order may affect the overall performance. Figure 7 summarizes the 5-way 1-shot experimental results on different attention arranging methods. From the results, we can find that capturing local attention sequentially achieves higher accuracy than doing in parallel on these four datasets. In addition, the order of channel attention and spatial attention also has a great impact on the performance of the model. It can be found that the combination order of spaces and channels can optimize the overall performance of the model. The combination of two attention methods outperforms the spatial or channel attention methods alone, which shows that adopting two kinds of attention is effective, and the optimal permutation strategy further improves the performance.

The choices of the hyperparameter.  To select appropriate adjusting hyperparameters for balancing loss terms in Eq. 20, we conduct a series of comparison experiments on miniImageNet, tieredImageNet, CUB-200-2011 and CIFAR-FS. Figure 8a shows the 5-way 1-shot and 5-way 5-shot accuracies on the test set of miniImageNet, we can find that \(\lambda = 0.25\) is the best among different adjusting hyperparameters for both 5-way 1-shot and 5-way 5-shot settings. Different datasets have different statistics, and the optimal value of \(\lambda \) may be different on multiple datasets. According to experiment results shown from Fig. 8b–d, we can conclude that the choice of an optimal hyperparameter is closely concerned with the sort of dataset and not so relevant with the settings(5-way 1-shot or 5-way 5-shot). For tieredImageNet, we get the best performance on the test sets when the hyperparameter \(\lambda =0.25\). For CIFAR-FS, \(\lambda = 0.5\) is the optimum choice. The model achieves higher accuracy when choosing \(\lambda = 1.5\) on CUB-200-2011.

Fig. 8
figure 8

The effect of different values of \(\lambda \) on different datasets

4.6 Feature Visualization Analysis

In this section, we provide visualization results to further understand the impact of our method. We apply Grad-cam [74] to PANet to visualize the model learning important regions in images. We randomly select two categories of images from the miniImageNet dataset and each category includes two images, then we feed them into the trained PANet. Figure 9 shows the learned weights visualized after the images pass through different attention modules in the PANet. The LEIA module includes regional spatial attention and adjacent channel attention, with the ability to focus and capture local features, making it able to highlight and emphasize targets in images than baseline methods. The GERA model demonstrates its capability to deactivate irrelevant features and locate the common region in the support-query image pair, indicating its proficiency in capturing these global features that are crucial for understanding the broader context and semantic relationships between objects or regions. With the subsequent integration of GERA module, the model exhibits a more accurate understanding of the overall localization of target objects. The highlighted regions in the heatmap increasingly overlap with the regions in the image containing the target objects. Both local and global features are valuable in image classification tasks, and the LEIA and GERA modules have been designed to leverage and exploit these features for accurate support-query image pair analysis.

Fig. 9
figure 9

Heatmaps of each module from PANet. Here we select two categories of four images for visualization. F is the output of the feature extractor, representing the base features; D is the output of LEIA module, representing the learned local features; G is the output of GREA module, representing the learned global attention features; subscripts s and q denote the support and query image, respectively

Fig. 10
figure 10

Comparison with other attention models. We compared four different types of attention mechanisms, SE represents the single channel attention; Local represents the single spatial attention; CAN represents the cross attention; CBAM represents the multi-attention. PANet is our proposed method which involves pluralistic attention mechanisms

We also conducted comparative experiments with other attention models in Fig. 10. For the same input image, we compare it through different attention model and visualize it through the Grad-CAM method. The visualization results show that our proposed PANet can capture broader and more accurate features compared with other attention models.

Fig. 11
figure 11

t-SNE visualization on miniImageNet. The top row is the classification of 5-way 1-shot task under different methods, the bottom row represents 5-way 5-shot task

Figure 11 is t-SNE visualization of the baseline, LEIA, GERA and PANet. We randomly sample 500 images from five test classes of miniImageNet and acquire the embeddings of all images using the trained models in 5-way 1-shot and 5-way 5-shot settings, separately. Then we apply t-SNE to reduce the dimensions of these embeddings into 2D and utilize different colors to represent different categories. From Fig. 11a, we can observe that the class boundaries are fuzzy for the embeddings produced by the baseline. Moreover, samples of the same category are scattered and samples of different categories are overlapped. By adding the LEIA, different classes get roughly precise boundaries in Fig. 11b. In addition, samples of the same category in Fig. 11c become more compact by applying the GERA. As shown in Fig. 11d, with the joint effect of two modules, our proposed method can generate embeddings with more gathering clusters and discriminative class boundaries. In the 5-way 5-shot setting, the visualization is better due to more training samples of the model. From Fig. 11e–h, we can find that the clusters are more compact and the category distinctions are more evident compared with the corresponding 5-way 1-shot visualization results.

5 Conclusion

In this work, we have designed a novel attention framework (PANet) for few-shot classification task, which involves pluralistic attention mechanism including local attention, global attention, spatial attention, channel attention, intra-attention and reciprocal attention. Our model can obtain discriminative representations by extracting comprehensive features of every single image with local encoded intra-attention(LEIA) module and capturing the most relevant regions of support-query sample pairs with global encoded reciprocal attention(GERA) module. Besides, with the contribution of dual-centralization (DC) cosine similarity, the distance between the support and query representations can be computed accurately. The experiment results demonstrate that PANet achieves new state-of-the-art performance on four standard FSL benchmarks and every essential component in the framework plays a crucial role. In future research, we plan to apply our method to other tasks, like semantic segmentation and object detection.