1 Introduction

With the development of deep learning and pre-trained language models, text classification has gained great success [1,2,3,4,5]. Generally, training deep-learning text classification models require a large amount of labeled data to achieve competitive performance. However, in real-world scenarios, certain classes have only a small amount of labeled data [6,7,8]. Labeling large amounts of data is labor-intensive and time-consuming. In contrast, humans can learn new concepts with only a small amount of data [9]. Inspired by this observation, researchers have focused on few-shot learning. Few-shot learning is often described as the N-way K-shot tasks. N is the number of classes, and K is the number of samples per class (K is generally small). The model needs to use these small amounts of data to predict unknown samples.

Meta-learning has recently been frequently used for few-shot classification tasks [1, 9,10,11,12,13,14,15,16,17]. Meta-learning usually uses episodic training to make the model acquire a certain generalization ability. Episodic training first randomly samples many different N-way K-shot tasks, which are used for training a model to extract meaningful features and compare sample similarities while reducing the impact of task-related parts. A key advantage of this approach is that it enables learning a metric function for determining the class of unknown samples, which is more effective in handling noise and differences between classes than learning the feature distribution of samples alone. In addition, an important advantage of learning a metric function is that it does not depend on the parameterization of sample features but only on calculating sample similarity or distance. Therefore, when using the same metric method in a new domain or task, the previously learned metric space can be directly applied without retraining the model. These advantages have led to an increased usage of metric-based meta-learning methods in the field of few-shot text classification. The metric-based meta-learning trains a feature encoder and a relational comparator. In each episode task, the feature encoder encodes all samples into feature vectors. Each class’s K sample (support set) vectors are averaged as class prototypes. The sample vectors to be predicted (query set) are compared with the class prototypes in the relational comparator to determine their classes. Snell et al. [18] proposed the prototypical network that uses simple Euclidean distance as a relational comparator and achieved excellent results. Pang et al. [10] discover relevant information between the query set and the support set by learning the interactions of similar samples. Sun et al. [11] added a ball generator to the metric-based meta-learning approach. The ball generator generates more sample vectors centered on the class prototype. The meta-learner and the ball generator are jointly trained using an end-to-end method. Xu et al. [12] focused on prototype aggregation progress ignoring the differences between different classes of samples. They suggested a graph network based on multi-view aggregation that combines node adjacency information with multi-dimensional information to provide intra-class similarity and inter-class variability features. Although existing methods have achieved promising results, they usually focus on knowledge in a few labeled samples and ignore knowledge in a much larger number of unlabeled samples.

Usually, self-supervised learning is an effective way to learn knowledge in unlabeled samples. However, it still needs to be solved to effectively transfer the knowledge learned by the model in the self-supervised phase to the meta-network’s training process. Recently, some researchers have used knowledge distillation [19,20,21] to apply the distribution predicted by the teacher model as a supervised signal to guide the training of the student model, which in turn can effectively transfer the knowledge learned by the teacher model to the student model. In this paper, to better transfer self-supervised knowledge to the meta-network, we introduce a knowledge distillation method suitable for few-shot learning. Unlike the existing knowledge distillation methods, our proposed model can enrich the meta-learning representation with self-supervised information. This can further fuse all information into a unified distribution to train a meta-network with stronger generalization capability. Specifically, we propose a new metric-based meta-learning method for few-shot text classification—Self-supervised Information Enhanced Meta-Learning (SEML). SEML uses a two-stage training approach—a self-supervised training stage and a meta-training stage. In the self-supervised stage, a feature encoder learns knowledge in unlabeled samples using a self-supervised approach. The trained self-supervised encoder then participates in the meta-training stage as a component. In the meta-training stage, the self-supervised encoder extracts the features of all samples in a task. A graph defined by the current task aggregates the samples information. Thus, the query set information and the support set information can interact. We incorporate the aggregated self-supervised information into the meta-learning representation. Then, the newly generated feature vectors directly participated in the training of the meta-network. In this way, the knowledge learned by the model in the self-supervised training stage can be efficiently transferred to the meta-training stage, and thus a more generalizable meta-network can be trained.

In summary, the main contributions of this paper are summarized as follows.

  • In order to improve the generalization ability of the meta-network, we introduce a new knowledge distillation method. It can merge the knowledge learned by the model in the self-supervised stage into the meta-feature vector to expand and enrich the original representation.

  • We use a graph structure with trainable parameters to aggregate node information. In this way, the support set and query set information can effectively interact to generate a more discriminative feature representation.

  • Extensive experiments were conducted on three public few-shot text classification datasets, and multiple metrics were used to evaluate the model’s performance. The experimental results on 5-way 1-shot and 5-way 5-shot show that the performance of SEML is superior to the existing state-of-the-art models.

This paper has the following structure: Sect. 2 discusses the related work of this paper. In Sect. 3, the problem is formalized. Section 4 describes our model in detail. In Sect. 5, we show how our model performs in various experiments. Section 6 is the conclusion of the paper.

2 Related Work

This section discusses the related work from three aspects: self-supervised learning, few-shot text classification, few-shot learning combined with knowledge distillation.

2.1 Self-Supervised Learning

Self-supervised learning tries to increase a model’s capacity for feature extraction. It exploits the features of the data itself as supervised information by designing proxy tasks [22,23,24]. Devlin et al. [25] and Liu et al. [26] trained language models to predict randomly masked tokens. Lan et al. [27] took a pair of consecutive sentences from a document, then randomly disordered the sentences, and finally classified whether the sentences were in the correct order. However, the aforementioned methods focus only on the samples themselves and ignore their differences.

Recently, contrastive learning has emerged as a significant research area for self-supervised learning. The goal is to bring similar samples (or enhanced samples) as close as possible in the feature space while pushing dissimilar samples further apart. To address the issue of text representation collapse in BERT, Yan et al. [28] put forward ConSERT, a text representation migration strategy based on contrastive learning. By fine-tuning, ConSERT enables the model to generate text representations better suited to downstream tasks’ data distribution. Gao et al. [29] noticed that the existing data augmentation methods were too intricate and suggested a straightforward approach of employing dropout for data augmentation, which produced encouraging performance. They introduced SimCSE, which leverages dropout to generate positive instances by putting a sample twice through an encoder to obtain a pair of related instances. Negative instances are extracted from other texts present in the same batch. The above method generally constructs positive instances by data augmentation. Unlike existing methods, Kim et al. [30]. focused on obtaining better sentence embeddings from pre-trained Transformers. They suggested a self-guided contrastive learning strategy for increasing the quality of BERT [25] text representations. Instead of relying on data augmentation, they fine-tune BERT using representations of different layers of BERT as positive samples. Furthermore, they redesign and apply the contrastive learning goal (NT-Xent) to text representation learning. In this paper, we use the self-supervised method proposed by Kim et al. [30] to extract the semantic information of instances.

2.2 Few-Shot Text Classification

Few-shot text classification focuses on processing text classification tasks containing only a few labeled samples. It investigates how to make the model learn enough prior knowledge and quickly generalize to new tasks [11, 12, 16, 31,32,33]. The existing few-shot text classification methods are divided into data augmentation, transfer learning, and meta-learning.

Increasing the minority class samples using data augmentation is a straightforward method [34]. Wei et al. [35] increased the number of samples by modifying the raw text. Kim et al. [36] are concerned that existing data augmentation methods neglect to capture the structural information of language. They generated augmented instances with diverse syntactic structures with plausible grammar. Unlike synthesizing new samples in raw text, Sun et al. [11] proposed MEDA. MEDA makes up for the lack of original data by calculating the smallest enclosing ball for the original samples in the feature space and synthesizing new samples within the ball. As the semantics of the synthesized training samples are similar to those of the original samples, increasing the generalization ability of few-shot learning models based on data augmentation is still challenging. Transfer learning typically pretrains encoders in a large-scale corpus and then applies the trained encoders directly to few-shot tasks. In this way, knowledge can be indirectly transferred to the current task [16]. Bao et al. [31] found that applying transfer learning to textual tasks can be challenging, as lexical features that are informative for one task may not be relevant for another. To address this problem, their model learns features not only from words but also from their distribution signatures. Han et al. [16] observed that relying solely on the distribution signatures of the training data may not be adequate to adjust the model for new tasks. To overcome this limitation, they proposed MLADA, which integrates an adversarial domain adaptation network to enhance the model’s adaptability and generate high-quality text embeddings for new categories. In order to classify new classes using just a few labeled samples, meta-learning tries to acquire general meta-knowledge by creating episodes. Meta-learning is divided into optimization-based methods and metric-based methods. By utilizing a few gradient update steps, optimization-based methods enable the model to adapt quickly to a new task. MAML [37] is a representative, optimization-based model. This approach recommends learning an appropriate model parameter initialization from the base classes and transferring these parameters to the new classes in a few gradient steps. Although optimization-based meta-learning methods have a certain effect, they may suffer from overfitting to a small set of tasks, which makes it challenging to apply them to new tasks. Additionally, these methods are highly sensitive to anomalies in the training data, which can lead to unstable performance. Measuring distances is usually easier than learning feature distributions. The metric-based meta-learning approach determines the class of a query by comparing the distance between the query vector and the class prototype vector. Thus, it has a stronger generalization capability and more robustness. The prototypical network [18] averages the support sets as prototypes and determines the class based on the Euclidean distance between the query and each prototype. Xu et al. [12] were concerned that the prototypical network could not fully utilize the valuable information in the support set and failed to distinguish between samples from different classes. To address these challenges, they proposed a multi-dimensional approach that considers the adjacency relationships between nodes and synthesizes features that consider intra-class similarity and inter-class variability. Pang et al. [10] believed that the separate processing of queries and support instances during the text encoding phase could not effectively capture the important features for classification. To address this issue, they incorporated cross-class knowledge by learning the distance distributions of multiple classes. Also, they extracted intra-class knowledge by modeling the interactions between the query and support instances. This approach enabled more comprehensive feature characterization and improved classification performance. Although existing few-shot text classification methods have achieved promising results, they learn knowledge only in a small number of labeled samples and ignore knowledge in a much larger number of unlabeled samples. In particular, we introduce a novel knowledge distillation method, which can improve the generalization ability of the meta-network by incorporating the knowledge learned in the self-supervised stage into the training process of the meta-network.

2.3 Few-Shot Learning Combined with Knowledge Distillation

Few-shot learning is defined in the case where only a few training samples are available. It is difficult to make the model learn sufficient generalization in this case. Knowledge distillation is mainly based on the idea of knowledge migration from a trained model (teacher) to an undertrained model (student) [19,20,21]. Generally, a teacher model is pre-trained on a vast corpus to learn enough knowledge. Then, the trained teacher model predicts a likelihood distribution for each training sample in a task. The student model is required to fit the distribution of existing labels and the distribution predicted by the teacher model. Recently, some researchers have tried to use knowledge distillation to compensate for the lack of training data.

Rashid et al. [38] combined adversarial training with knowledge distillation. They not only pre-trained a teacher model but also trained a generator. Then, they use the generated data and out-of-domain data for knowledge distillation. Sauer et al. [39] focused on the challenge of balancing few-shot learning settings and the requirement of knowledge distillation. Specifically, few-shot learning is often applied to situations where the number of samples is limited, while knowledge distillation typically requires a large number of annotated samples. The authors proposed a solution that involves a multi-step meta-training approach with episodic knowledge distillation. This approach allows for the application of limited samples to new classes and domains. Li et al. [40] addressed the issues of high memory consumption and ineffective performance in situations of limited samples. Their solution involved constructing a text graph that text features as nodes and the interactions between them as edges. In addition, they used knowledge distillation to gain more knowledge for few-shot scenarios.

Unlike existing approaches, SEML incorporates the knowledge learned in the self-supervised stage into the feature vectors generated by the meta-network rather than using the self-supervised model to provide additional supervised signals. Compared with existing knowledge distillation methods, our method can broaden and enrich the original representation to train a more powerful meta-network.

3 Problem Definition

We use episodic training to train our model, which has been successful in numerous few-shot tasks [1, 9,10,11, 16,17,18, 31, 33, 39, 41, 42]. Episodic training maintains the consistency of the training process and testing process. The main point is to construct N-way K-shot tasks.

Given a dataset containing multiple labels, we randomly sample data from several classes as the training set \(y_{train}\) and data from the remaining classes as the testing set \(y_{test}\), \(y_{train} \cap y_{test}=\phi \). In the training set, we randomly sample N classes, and each class samples K instances as the support set, and the rest instances as the query set. Usually, N is taken to be 5, and K is a tiny number that is taken to be 1 or 5. Through multiple sampling, we could obtain multiple N-way K-shot tasks. During the training process, the query sets are labeled. The model updates the parameters using labeled query sets. Similar to the training process, we also acquire multiple N-way K-shot tasks in the test set. The trained model uses the support set to predict the unlabeled query set.

4 Methodology

As shown in Figs. 1 and 2, our model adopts a 2-stage training pipeline. First, our model uses unlabeled samples to train a feature encoder \(\text {BERT}^{turn}_{ \omega }\). In the second stage, called the meta-training stage, the feature vectors encoded by \(\text {BERT}^{turn}_{\omega }\) first construct a graph. Then, a graph aggregation module aggregates the node information and generates a more discriminative feature representation. Next, our model distills the knowledge learned in the first stage into the feature representation of the meta-encoder \(\text {BERT}_\theta ^{meta}\) by a novel knowledge distillation method. The newly generated feature representations participate in the training of the meta-network.

Fig. 1
figure 1

Self-supervised training stage

Fig. 2
figure 2

Meta-training stage

4.1 The Self-Supervised Learning Stage

In this stage, our model is trained by removing all labels from the training data. More details are shown in Fig. 1. We apply self-guided contrastive learning [30] as the self-supervised learning task. Considering that different middle layers of BERT contain different information about the same semantics, self-guided contrastive learning uses the representation of the middle layers of BERT as positive samples and applies the representation of other text as negative samples to optimize the [CLS] representation using a contrastive learning approach.

In detail, we clone BERT into two copies at the beginning of training., i.e., \(\text {BERT}^{fix}\) and \(\text {BERT}^{turn}_{\omega }\). \(\text {BERT}^{fix}\) is fixed to provide the training signal, and \(\text {BERT}^{turn}_{\omega }\) makes use of fine-tuning to learn a better text representation.

Given b text in a mini-batch, say \(x_1,x_2,\cdots , x_b,\) we bring each text \(x_i\) into \(\text {BERT}^{fix}\) and calculate the token-level hidden layer representation \(H_{i,k}\in \mathbb {R}^{len(x_i)\times d}\):

$$\begin{aligned}{}[H_{i, 0} ; \ldots ; H_{i, k} ; \ldots ; H_{i, l}]=\text {BERT}^{f i x}(x_{i}) \end{aligned}$$
(1)

where \(0\le k\le l\) (0: the non-contextualized layer), l represents the number of hidden layers in BERT, d stands for the dimension of the BERT hidden layer representation, and \(len(x_i)\) stands for text length. Then, a max-pooling function is applied to each layer of \(H_{i,k}\) to extract the most significant features:

$$\begin{aligned} h_{i, k}=\text {maxpooling} \left( H_{i, k}\right) \end{aligned}$$
(2)

where \(h_{i,k}\) is the feature extracted from \(H_{i,k}\) after maximum pooling.

Since different BERT layers specialize in capturing different linguistic concepts [43], we treat the hidden states of each layer as of equal importance and apply a uniform sampler p to obtain the final view:

$$\begin{aligned} h_i=p\left( \left\{ h_{i, k} \mid 0 \le k \le l\right\} \right) \end{aligned}$$
(3)

As shown in Fig. 1, our goal is to train \(\text {BERT}^{turn}_{\omega }\) to learn the knowledge in the unlabeled samples and generate a better representation at the [CLS] position in the last layer. Given a text \(x_i\), its [CLS] vector is defined as

$$\begin{aligned} cls_i=\text {BERT}_\omega ^{ turn}\left( x_i\right) _{\text {[CLS]}} \end{aligned}$$
(4)

We use \(\text {BERT}^{fix}\) to encode positive and negative samples. A negative sample is any text \(x_j\) in the mini-batch other than the positive sample \(x_i\). The NT-Xent loss is used as the loss function in the self-supervised stage:

$$\begin{aligned} L^{\text{ self }}(\omega )=\frac{1}{b(l+1)} \sum _{i=1}^b \sum _{k=0}^l L_{i, k}(\omega )+\eta \cdot L^{r e g}(\omega ) \end{aligned}$$
(5)

where \(L^{reg}=\left\| \text {BERT}^{fix}-\text {BERT}_{\omega }^{turn}\right\| _{2}^{2}\) is the regularization term to prevent \(\text {BERT}^{turn}_{\omega }\) from deviating too far from \(\text {BERT}^{fix}\). \(L_{i,k}(\omega )\) is a contrastive loss. It aims to make \(cls_i\) closer to the positive sample \(h_{i,k}\) and further away from the negative sample \(h_{m,n}\). The formula for \(L_{i,k}(\omega )\) is as follows:

$$\begin{aligned} \small{} & {} L_{i, k}(\omega )\nonumber \\{} & {} \quad =-\log \frac{\phi (cls_i, h_{i, k})}{\phi (cls_i, h_{i,k})+\sum \nolimits _{m=1, m \ne i}^b \sum \nolimits _{n=0}^l \phi (cls_i, h_{m, n})}\nonumber \\ \end{aligned}$$
(6)

where

$$\begin{aligned} \phi (x, y)=\exp ({\text {cosine}}(f(x), f(y)) / \mu ) \end{aligned}$$
(7)

and

$$\begin{aligned} {\text {cosine}}(x, y)=\frac{x \cdot y}{\Vert x\Vert \times \Vert y\Vert } \end{aligned}$$
(8)

where f represents a multilayer perceptron containing two linear layers, \(\mu \) denotes a temperature coefficient and \(\text {cosine}(x,y)\) is the cosine similarity function.

After completing self-supervised learning, we eliminate all components other than \(\text {BERT}^{turn}_{\omega }\) and only use \(cls_i\) as self-supervised text representation.

We present the specific implementation of the self-supervised stage in Algorithm 1.

figure a

4.2 The Meta-training Stage

In the meta-training stage, we use the information learned in the first stage to enhance the sample features so as to train the meta-learning network more effectively. We introduce a knowledge distillation module to incorporate the knowledge learned in the self-supervised stage into the meta-learning network. First, the trained \(\text {BERT}^{turn}_{\omega }\) encodes each sample \(x_i\) in a task as a feature vector \(v_i\). Then, a feature matrix \(V=[v_1,\cdots ,v_n]^T\) can be generated by calculating the \(v_i\) of different samples, where n is the total number of samples. Following that, we create a graph G by computing the cosine similarity of these samples, where the nodes represent the features of these samples and the edges represent the cosine similarity between the nodes:

$$\begin{aligned} {\text {Sim}}_{i, j}=\left\{ \begin{array}{lr} {\text {cosine}}\left( V_{i,:}, V_{j,:}\right) &{} {i} \ne {j} \\ 0 &{} {i}={j} \end{array}\right. \end{aligned}$$
(9)

where \({V_{i:}}\), denotes the i-th row in V. Then, we normalize them to obtain an adjacency matrix A and a degree diagonal matrix D. We use a Laplacian matrix to aggregate information:

$$\begin{aligned} E=D^{-\frac{1}{2}} A D^{-\frac{1}{2}} \end{aligned}$$
(10)

We introduce graph networks with the purpose of interacting the support set information with the query set information. The Laplacian matrix is used to aggregate the information between nodes and generate a more discriminative feature representation. For a sample \(x_i\), we hope the feature representation of \(x_i\) itself and the node information associated with it to generate a new node representation. Meanwhile, the node information generated after graph aggregation is used as dynamic weights to guide the generation process. We define this as a novel knowledge distillation method that distills the knowledge learned in the self-supervised stage to the meta-learning representation. The feature vector generated after the knowledge distillation method is computed as follows:

$$\begin{aligned} v_i^*=\sum _{j \in [1, n]}(\gamma I+E)_{i, j}^\tau \text {BERT}_\theta ^{meta}\left( x_j\right) \end{aligned}$$
(11)

where \(v^*_i\) denotes a vector representation incorporating self-supervised knowledge and neighbor node information, I stands for the identity matrix, and \(\tau \) is a parameter that denotes the frequency of the aggregating feature. \(\gamma \) is a trainable parameter used to balance the neighbors and itself. The i-th row and j-th column of \((\gamma I+E)^{\tau }\) is represented as \((\gamma I+E)^{\tau }_{i,j}\).

We use the prototypical network as the backbone for its simplicity and effectiveness. For a query sample \(x_q\), we measure the distance between the feature vector of \(x_q\) and the class prototypes to predict the class of \(x_q\). The prototype of the kth class is the average of the support set vectors:

$$\begin{aligned} \mathbb {C}_k=\frac{1}{\mid K \mid } \sum _{x_t \in S_i^k} v_{x_t}^* \end{aligned}$$
(12)

where \(\mathbb {C}_k\) is the prototype of the kth class. \(S_i^k\) denotes the set of samples of the kth class in the support set \(S_i\). The probability score of \(x_q\) with class k with knowledge distillation can be calculated as follows:

$$\begin{aligned} y_{k, q}^*=\frac{\exp \left( d\left( v_q^*, \mathbb {C}_k\right) \right) }{\sum _{j \in [1, N]} \exp \left( d\left( v_q^*, \mathbb {C}_j\right) \right) } \end{aligned}$$
(13)

where \(y_{k, q}^*\) is the probability score that \(x_q\) belongs to class k when knowledge distillation is used. \(d(\cdot , \cdot )\) is the distance metric function, and we use two types of metrics.

One is the Euclidean distance. Since the larger the distance, the smaller the correlation, we use the opposite of the squared Euclidean distance for the metric, with the following formula:

$$\begin{aligned} d\left( x_1, x_2\right) =-\left( x_1-x_2\right) ^2 \end{aligned}$$
(14)

In addition, we try the cosine metric with the parameters:

$$\begin{aligned} d\left( x_1, x_2\right) =\cos \left( l\left( x_1\right) , l\left( x_2\right) \right) \end{aligned}$$
(15)

where l is a linear layer for expanding the feature dimension.

We construct multi-task training to train the meta-network jointly. The probability score of \(x_q\) with class k without knowledge distillation can be calculated for the support set \(S_i\) and a query sample \(x_q\) as follows:

$$\begin{aligned} y_{k, q}^{meta}=\frac{\exp \left( d\left( \text {BERT}_\theta ^{meta}\left( x_q\right) , \varvec{c}_k\right) \right) }{\sum _{j \in [1, N]} \exp \left( d\left( \text {BERT}_\theta ^{meta}\left( x_q\right) , \varvec{c}_j\right) \right) } \end{aligned}$$
(16)

where

$$\begin{aligned} \varvec{c}_k=\frac{1}{\mid K \mid } \sum _{x_t \in S_i^k} \text {BERT}_\theta ^{meta}\left( x_t\right) \end{aligned}$$
(17)

The cross-entropy loss is used to calculate the model loss. The hyperparameter \(\lambda \) is used to adjust the proportion of losses for the two tasks (training task with knowledge distillation and training task without knowledge distillation) in the whole loss function:

$$\begin{aligned} \begin{aligned} L^{meta}(\theta )=&-\lambda \sum _{k=1}^N \sum _{q=1}^{ \mid Q_i \mid } y_{k, q}\log \left( y_{k, q}^*\right) \\ {}&-(1-\lambda ) \sum _{k=1}^N \sum _{q=1}^{\mid Q_i \mid } y_{k, q}\log \left( y_{k, q}^{meta}\right) \end{aligned} \end{aligned}$$
(18)

where \(y_{k,q}\) represents the ground-truth label of \(x_q\). The process of the meta-training stage is shown in Algorithm 2.

figure b

In the testing phase, we remove the \(\text {BERT}_\omega ^{turn}\) and the graph structure, and test with the trained \(\text {BERT}_\theta ^{meta}\) as well as the prototypical network.

5 Experiments

In the experiment section, we explore the following three research questions.

RQ1: How does SEML perform compared to existing models?

RQ2: Are the innovations we propose effective?

RQ3: How does the different hyperparameter \(\lambda \) affect the performance of the model?

5.1 Datasets

Three public text classification datasets are used to assess our model. These datasets have been widely applied to few-shot text classification tasks.

Table 1 Statistics of the three benchmark datasets

HuffPost: HuffPost is comprised of HuffPost news headlines published from 2012 to 2018. It contains 36,900 instances of news headlines and is divided into 41 classes.

20News: 20News contains about 20,000 news items, which are evenly divided into 20 news groups of different topics.

Reuters: Reuters collected from the Reuters news agency. We used the standard ApteMode version of the dataset and removed the multi-labeled documents.

Among them, 20News and Reuters are document-level datasets and HuffPost is a sentence-level dataset. The statistics of these datasets are summarized in Table 1.

5.2 Implementation Details

In our experiments, we tested two few-shot scenarios, 5-way 1-shot and 5-way 5-shot, on three datasets. In the training process, 100 episodes were sampled per epoch. In the testing process, we sampled 1000 episodes. For the meta-training stage, we used BERT-base as the encoder and pre-trained it on the datasets. The model parameters are updated using an Adam optimizer, the encoder’s initial learning rate is set to 5e-6, and the batch size is set to 128. We use two different metric functions—Euclidean distance and cosine similarity function. For the self-supervised part, we need to train a self-supervised model for each dataset. The hyperparameter \(\eta \) for the self-supervised part is set to 0.1, \(\mu \) is set to 0.01, the batch size is set to 16, and the learning rate is 5e-5. We adopted three metrics to comprehensively evaluate our model performance: Accuracy, F1-macro, and F1-weighted. Accuracy represents the proportion of correctly classified queries to the total number of queries. F1-macro is the arithmetic mean of the F1 scores of all categories. F1-weighted is the weighted mean of the F1 scores of all categories. Specifically, we measured using the metrics module provided by scikit-learn.Footnote 1 Furthermore, our model is implemented on a Linux system with Pytorch 1.8.1, CUDA 11.1 framework. The hardware configuration is an AMD EPYC 7543 32-Core Processor with 15 CPUs and an NVIDIA A40 GPU with 48 G memory.

5.3 Comparison Experiment (RQ1)

5.3.1 Baseline Models

We compare the proposed method to various widely employed few-shot learning approaches that have demonstrated promising results in natural language processing. In addition, we compare our model with recently proposed state-of-the-art models.

MAML [37]: MAML aims to help the model discover a suitable starting parameter. The model can use this parameter as a starting point to easily fit new tasks.

Prototypical Network [18]: The prototypical network is a metric-based approach utilizing instance averages as class prototypes. It aims to learn per-class prototypes based on instance averaging in the feature space.

Induction Network [41]: The induction network integrates the relation network and dynamic routing algorithm to facilitate the meta-learning process of learning the class vectors.

HATT [42]: HATT improves the robustness of the model to text diversity and noise by introducing instance-level and feature-level attention in the prototypical network.

Distributional Signature [31]: The distributional signature is an advanced model proposed by Bao et al. [31]. They aim to discover how to leverage the statistical pattern of tokens to selectively heed crucial information of the input text and construct a meta-learner with improved generalization performance.

MLADA [16]: MLADA is a state-of-the-art model proposed by Han et al. [16]. By integrating with an adversarial domain adaptation network, they seek to enhance the mode’s capacity for adaptation.

Table 2 Performance comparison of different methods on three few-shot text classification datasets with 5-way 1-shot and 5-way 5-shot cases

5.3.2 Results

Fig. 3
figure 3

Results of SEML ablation experiments on two datasets, where KD represents knowledge distillation and GA represents graph aggregation

Fig. 4
figure 4

Effect of different \(\lambda \) on model accuracy on two different datasets

As shown in Table 2, SEML obtains the highest accuracy for both the 5-way 1-shot and 5-way 5-shot on all three datasets. Compared with MAML, the Prototypical Network outperforms MAML on the document-level datasets (Reuters and 20News), while it is worse than MAML on the sentence-level dataset (HuffPost). This may be due to the fact that document-level data have richer features, and richer features are more conducive to the Prototypical Network for inter-vector distance metrics. While MAML has a certain generalization capability by learning the initialization parameters, which is more effective than distance metrics under feature sparsity. Compared with Prototypical Network, Induction Network works better on the sentence-level dataset. This may be due to the fact that the dynamic routing algorithm and the multilayer perceptron classifier are more conducive to metrics in the case of feature sparsity and also allow the model to learn stronger generalization ability. On the 20News dataset, Induction Network is significantly less effective than Prototypical Network and HATT, which further indicates that the distance metric is more effective on the document-level dataset. The average accuracy of HATT in the 5-way 5-shot case is higher than that of the Prototypical Network, which indicates that instance-level attention as well as feature-level attention can effectively mitigate the noise interference to the model. Distributional Signature outperforms the traditional model in both the 5-way 1-shot and 5-way 5-shot cases for all three datasets, indicating that adequate use of word importance and distributional signature can effectively improve the generalization ability of the model. MLADA combines domain adversarial network with meta-learning to generate a more comprehensive set of transferable features, and thus its overall performance is better than that of Distributional Signature. Although existing methods achieve good performance, they only consider the knowledge in labeled samples and ignore the knowledge in unlabeled samples. Unlike existing methods, SEML incorporates the knowledge learned in the self-supervised stage into the training process of the meta-network through knowledge distillation. Through this process, we can train a more robust meta-network. Second, SEML has a higher average accuracy in applying cosine distance as a metric function. This may be due to the fact that the semantic similarity between texts is more reflected in the directional differences between vectors rather than the numerical differences between vectors.

5.4 Ablation Study (RQ2)

In this paper, we present two main innovations: one is a novel knowledge distillation method applied to few-shot text classification that incorporates the knowledge learned by the model during the self-supervised stage into the training process of the meta-network. Another is a graph aggregation method applied to few-shot scenarios, aggregating features between samples to obtain a more discriminative feature representation. We tested the effectiveness of these two innovations on a sentence-level text classification dataset (HuffPost) and a document-level text classification dataset (20News) using 5-way 1-shot and 5-way 5-shot, respectively. Figure 3 shows the test results. SEML-KD represents the process of removing knowledge distillation, and the model is trained using the node features output after graph aggregation. SEML-KD-GA represents the removal of knowledge distillation and the graph aggregation process.

According to Fig. 3, the accuracy of SEML decreases substantially after removing the knowledge distillation process, regardless of the datasets and the settings. This is because SEML can apply knowledge distillation to broaden and enrich the feature representation of meta-learning, and the newly generated feature representation participates in the training process of the meta-network. In this way, the meta-network can learn more knowledge, which in turn can effectively improve the generalization ability. Second, the model’s accuracy will be further decreased after removing the graph aggregation module. The graph aggregation module can aggregate relevant information through the interaction between the support set and the query set. This process generates a more discriminative self-supervised feature representation, which is more conducive to model learning.

5.5 Hyper-parameter Adjustment (RQ3)

\(\lambda \) is a hyperparameter used to adjust the ratio of knowledge distillation loss to overall loss. A larger \(\lambda \) indicates that the feature vectors generated by knowledge distillation have a greater impact on the meta-network. Figure 4 shows the effect of different \(\lambda \) on model accuracy for the 5-way 1-shot and 5-way 5-shot cases tested on HuffPost and 20News.

According to Fig. 4, we can find that for different datasets and test scenarios, as \(\lambda \) increases, the model’s accuracy increases to its maximum value and then decreases. This is because when \(\lambda \) is too small, the feature vectors generated by knowledge distillation are weaker for training the meta-network. This means that the meta-network learns less self-supervised knowledge. When \(\lambda \) is too large, the model relies excessively on the knowledge distillation for training the meta-network. While this training approach is inconsistent with using only \(\text {BERT}_\theta ^{meta}\) to encode sample features during testing, and thus \(\text {BERT}_\theta ^{meta}\) is not effectively trained. When \(\lambda \) is moderate, the model can balance effective training and fused knowledge to achieve optimal results.

6 Conclusion

In this paper, we focus on existing metric-based meta-learning few-shot text classification methods that focus only on learning from a small number of labeled samples and ignore learning from unlabeled samples. We design a two-stage training paradigm—a self-supervised training stage and a meta-training stage. In the self-supervised stage, the model learns knowledge in unlabeled samples by self-guided contrastive learning. In the meta-training stage, we introduce a novel knowledge distillation method. It can effectively incorporate the knowledge learned by the model in the self-supervised stage into the meta-learning representation. In particular, we design a graph aggregation structure before merging the self-supervised information. It allows support set and query set information to interact efficiently in the graph network to generate more discriminative self-supervised representations. Multi-task training is applied to our model, and we use the original meta-learning representation and the enhanced meta-learning representation to train a meta-network jointly. We test the performance of our model on three public few-shot text classification datasets, extensive experiments show that our model outperforms existing state-of-the-art models.

Although our model achieves better performance, it still has some limitations. The current approach focuses only on learning the knowledge in samples while ignoring external knowledge. For example, some large-scale knowledge graphs contain structured knowledge of good quality. In the future, we expect to use this structured knowledge to assist in generating metric parameters. In this way, knowledge can be migrated between tasks. I.e., making similar tasks with similar metrics and different tasks with different metrics.