1 Introduction

Continual Learning (CL) aims to address the stability-plasticity dilemma in the context of sequential learning. Stability refers to the ability to alleviate the decline in model performance on previously learned tasks, in other words, preventing catastrophic forgetting (McCloskey and Cohen 1989). Notably, inter-task confusion (Huang et al. 2023) is one of the major reasons that causes catastrophic forgetting. It is a phenomenon where the model confuses classes from different tasks. Plasticity refers to the ability to adapt to a new task. However, the majority of existing CL methods presume the availability of labeled data in abundance, deeming it adequate for learning every task. Their performance heavily relies on the large quantity and high quality of labeled data, which is impractical.

In real-world scenarios, labeled data are scarce while unlabeled data are abundant. The cost of annotation, particularly in the field of Natural Language Processing (NLP), tends to be prohibitively high. To narrow this gap, we address the problems in a setting that more closely aligns with real-world scenarios, namely few-shot Continual Active Learning (CAL) (Ayub and Fendley 2022). In this setting, only a small subset of labeled data is provided for each task with a limited annotation budget. In such a way, model should sequentially select the most worthwhile examples from a pool of unlabeled data and requests labels to enhance performance and simultaneously solve continual learning problems. Introducing active learning into the scheme can be challenging as active learning techniques are typically designed to query from a static data distribution. They may not be able to dynamically capture most relevant samples for queries to prevent catastrophic forgetting.

Replay-based methods have been shown to be particularly effective for NLP tasks (Wang et al. 2022). These methods often retain a certain amount of past samples to prevent catastrophic forgetting. Consequently, they are prone to memory over-fitting, especially when the availability of labeled examples is limited. However, active learning can further escalate the problem by biasing the sample selection towards the memory set. Hence, integrating active learning techniques into CL model can be quite challenging.

Given the success of meta-learning on solving low plasticity, certain studies (Riemer et al. 2019; Gupta et al. 2020) extend meta-learning to CL setting. In this work, we exploit the advantages of meta-learning and uses Model-Agnostic Meta-Learning (MAML) (Finn et al. 2017) framework to solve few-shot CAL problems, called Meta-Continual Active Learning. By applying active learning for task-specific tuning and casting meta-objective as experience replay, the learning objective is deliberately formulated to learn an optimal or a suboptimal initial model state that can rapidly adapt to a balanced subset of all encountered task. Thereby, this method enables fast adaptation while preventing catastrophic forgetting in few-shot learning. In addition, we apply consistency regularization via textual augmentations to address memory overfitting problems that inherent in replay-based method and exacerbated by active learning acquisition.

We conduct extensive experiments on benchmark datasets from Zhang et al. (2015), popularized by de Masson d’Autume et al. (2019) in lifelong language learning. This collection of datasets includes five text classification datasets from four diverse domains. We demonstrate the effectiveness of the proposed framework in a 5-shot CAL setup. This paper also examines how various active learning approaches impact the performance of meta continual learning models.

The main contributions of this paper are fourfold:

  • Leveraging the strengths of meta-learning, we introduce an optimization-based method, namely Meta-Continual Active Learning (Meta-CAL). This method reformulates the meta-objective such that it learns an optimal or a suboptimal initial model state that can effectively adapt to all seen tasks. Thereby, it provides a solution to inter-task confusion and catastrophic forgetting even with limited availability of labeled samples.

  • We integrate active learning into the proposed framework to enhance task-specific tuning. This allows the model to dynamically and selectively query the most informative samples from a pool of unlabeled data, thereby improving performance in a resource-constrained scenario.

  • To address inevitable memory overfitting problems caused by experience replay and active learning, we apply consistency regularization to meta examples via data augmentations. This further ensures intra- and inter-task generalization.

  • In the experiments, the proposed method achieves an accuracy of more than 62% while utilizing only 1.6% of the past samples and maintaining annotation budgets as low as 500 samples for each task. The results demonstrate the feasibility and effectiveness of meta-continual active learning. Furthermore, we observe that random sampling facilitates generalization in meta continual learning, thereby addressing the stability-plasticity dilemma.

2 Related work

2.1 Continual learning

Existing approaches can be categorised into three main mainstreams, i.e., regularization-based methods (Kirkpatrick et al. 2017; Li and Hoiem 2018; Lin et al. 2024), replay-based methods (de Masson d’Autume et al. 2019; Ho et al. 2023) and architecture-based methods (Adel et al. 2020; Yoon et al. 2018; Wang et al. 2023).

Regularization-based methods often impose a penalty or regularization term to the loss function. These methods often penalise changes in non-trivial parameters or constrain the variations in gradients learned from previous tasks. However, most methods for NLP show a preference for replay-based approaches (Wang et al. 2022) to avoid unexpected output from tuning the parameters of deep neural networks (Wang et al. 2019).

Replay-based methods, also known as rehearsal-based methods or memory-based methods, involves revisiting a small amount of past samples (i.e., experience replay) or generating pseudo past samples while adapting to new domain. The popular retrieval schemes for experience replay are random (Chaudhry et al. 2019; Riemer et al. 2019; de Masson d’Autume et al. 2019; Holla et al. 2020), K-Means (Wang et al. 2019; Han et al. 2020) and Mean-of-Feature (Qin and Joty 2022; Chen et al. 2023).

Architecture-based methods dynamically change the model architecture to learn a new task. In general, these methods preserve or partially preserve past fine-tuned parameters and introduce task-specific parameters for the new domain. However, it is challenging to manage the scale of the model since the parameter size continuously accumulates with the increase in the number of seen tasks.

2.2 Meta-continual learning

In recent years, meta-learning has emerged as an effective learning framework for CL. In particular, the bi-level optimization of meta-learning enables fast adaptation to training data while ensuring generalization across all observed samples. In meta-continual learning, meta-learning is often combined with memory replay. Meta-MbPA (Wang et al. 2020) uses MAML to augment episodic memory replay via local adaptation. OML-ER (Holla et al. 2020) and ANML-ER (Holla et al. 2020) utilise an online meta-learning model (OML) (Javed and White 2019) and a neuromodulated meta-learning (ANML) (Beaulieu et al. 2020) respectively for effective knowledge transfer through fast adaptation and sparse experience replay. PMR (Ho et al. 2023) also employs MAML for facilitating episodic memory replay but using a prototypical memory sample selection approach. MER (Riemer et al. 2019) regularizes the objective of experience replay by graident alignment between old and new tasks via a modified Reptile (Nichol et al. 2018). C-MAML (Gupta et al. 2020) utilizes OML to regulate CL objectives and La-MAML (Gupta et al. 2020) optimizes OML objective through the modulation of per-parameter learning rates. Meta-CL (Wu et al. 2024) further improves C-MAML and La-MAML by introducing penalty to restrain unnecessary model updates and preserve non-trivial weights for knowledge consolidation. SB-MCL (Lee et al. 2024) enhances the advantages of meta-learning by integrating sequential Bayesian updates to bridge statistical models with meta-learned neural networks. To date, not many models use gradient alignment for lifelong language learning. In addition, most of these models are employed in a supervised setting. In this work, we focus on aligning meta-learning with CL objectives to provide a solution in a realistic and resource-constrained scenario for NLP.

2.3 Continual active learning

Continual active learning aims to sequentially labels informative data to maximise model performance and solve continual learning problems. It defines a continual learning problem where the available labeled data are insufficient and the annotation budgets are limited. CASA (Perkonigg et al. 2021) detects new pseudo-domains and selected data from new pseudo-domains for annotation, while it revisits labeled samples to address catastrophic forgetting. Ayub and Fendley (2022) propose a method to address few-shot CAL (FoCAL). They use a Guaussian mixture model (GMM) for active learning and pseudo-rehearsal for CL, bypassing the need to store real past data. However, neither of these methods addresses continual active learning in NLP. CAL-SD (Das et al. 2023) tackles NLP tasks, and uses model distillation to augment memory replay with a diversity and uncertainty AL strategies. To date, continual active learning remains understudied, especially in NLP.

3 Preliminaries

In this work, we focus on task-free class-incremental learning scenario, where training data stream passes only once without the presence of “task boundaries” (van de Ven et al. 2021). Based on the task-free setting, we formulate the problem of few-shot continual active learning as follows. Assume that training stream consists of T tasks, \(\{\mathcal {T}_1,\mathcal {T}_2,...,\mathcal {T}_t,...,\mathcal {T}_T\}\). Each task \(\mathcal {T}_{t}\) contains \(N_t\)-way K-shot set \(\mathcal {D}^{label}_t = \{(x_{i},y_{i})\}_{i=1}^{N_t \times K}\) and a pool of unlabeled data \(\mathcal {D}^{pool}_t = \{(u_{i})\}_{i=1}^{|\mathcal {D}^{pool}_t|}\).

Label space: Based on the label space \(\mathcal {Y}\) of tasks, typical continual learning scenarios have domain-incremental learning, where \(\mathcal {Y}_t = \mathcal {Y}_t', \forall t \ne t'\) and class-incremental learning, where \(\mathcal {Y}_t \cap \mathcal {Y}_t'= \emptyset , \forall t \ne t'\). Due to the unforeseen nature of sequential learning, the label space \(\mathcal {Y}_t\) of a task may or may not be disjoint from the other tasks. Hence, we allow \(\mathcal {Y}_t \cap \mathcal {Y}_t' \ne \emptyset\) or \(\mathcal {Y}_t \cap \mathcal {Y}_t'= \emptyset , \forall t \ne t'\) to occur.

Annotation constraint: We consider each task \(\mathcal {T}_{t}\) to have an equal annotation budget \(B_{A}\). An acquisition function \(a(\cdot )\) dynamically query informative data points for annotation until acquisition process reaches annotation budget. We denote the selected samples for annotation as \(a(\mathcal {D}^{pool}_t, B_{A})\) and newly annotated sample set as \(\mathcal {D}^{new}_t\).

Memory constraint: Replay-based model revisits a small subset of labeled data \(\mathcal {D}_{1:1-t}\) to regularize model f while learning \(\mathcal {T}_t\). We limit the amount of past training data saved in memory buffer \(\mathcal {M}\), which should not exceed memory budget threshold \(B_{\mathcal {M}}\).

Objectives: Assume we have a learner \(f_{\theta }\) and current task \(\mathcal {T}_t\), the learning objectives are: (a) perform adaptation on \(\mathcal {D}_t = \mathcal {D}^{label}_t \bigcup \mathcal {D}^{new}_t\), i.e., plasticity:

$$\begin{aligned} \tilde{\theta }_{t} = \arg \min _{\theta _{t} \in \Theta } \mathbb {E}_{ (x,y) \sim \mathcal {D}_{t}} [\mathcal {L}\big (f_{\theta _t}(x),y \big )] \end{aligned}$$
(1)

where \(\mathcal {L}_{\mathcal {T}_t}\) is the task loss, \(\theta _{t}\) is the initial state and \(\theta _{t} = \tilde{\theta }_{t-1}\); (b) prevent inter-task confusion and catastrophic forgetting of prior tasks, i.e., stability:

$$\begin{aligned} \min _{\tilde{\theta }_t} \frac{1}{|t-1|} \sum _{i=1}^{t-1} \mathbb {E}_{ (x,y) \sim \mathcal {D}_{i}} [\mathcal {L}\big (f_{\tilde{\theta }_t}(x),y \big )] \end{aligned}$$
(2)

3.1 Active learning strategies

In this work, we consider four popular active learning (AL) methods as follows.

Uncertainty: This method samples data \(\varvec{x}\) with high uncertainty. The uncertainty is measured by model outputs \(\hat{y}\). Least-confidence method (Culotta and McCallum 2005) evaluates uncertainty by the confidence in predication, where lowest posterior probability indicates greater uncertainty \(\alpha _{LC}(\varvec{x}, n_a)= -Pr(\hat{y}|\varvec{x})\). Margin-confidence (Netzer et al. 2011) method considers the confidence margin between two most likely predictions \((\hat{y}_1, \hat{y}_2)\), \(\alpha _{Marg.}(\varvec{x}, n_a)= -|Pr(\hat{y}_1|\varvec{x})-Pr(\hat{y}_2|\varvec{x})|\). Small difference indicates high uncertainty. Entropy-based (Shannon 2001) method uses predictive entropy H as the indicator, where higher entropy shows more uncertainty in the posterior probability \(\alpha _{Entr.}(\varvec{x}, n_a)= H(Pr(\hat{y}|\varvec{x}))\).

Representative: This method selects data \(\varvec{x}\) that are geometrically representative in vector space (Schröder et al. 2022). In this work, input data \(\varvec{x}\) that have the shortest euclidean distance to the centroid of a cluster are representative. KMeans uses the centroid of each cluster, which applies unsupervised learning to partition the data into clusters. The number of clusters is the selection size \(n_a\). In this work, we also introduce Mean vectors method as a baseline for comparison. It averages representation vectors of each training batch as the centroid.

Diversity: This method chooses data \(\varvec{x}\) that are geometrically seen as outliers in vector space (Mosqueira-Rey et al. 2022). In this work, input data \(\varvec{x}\) with the longest Euclidean distance from a centroid are considered diverse. The centroid selections align with those used in the Representative method.

Random: This method randomly samples data \(\varvec{x}\) from unlabeled pool \(\mathcal {D}^{pool}\).

4 Learning to learn for CAL

Model-Agnostic Meta-Learning (MAML) (Finn et al. 2017) is an optimization-based meta-learning approach, often referred to as the “learning-to-learn” algorithm. Learning to learn allows a model to adapt to different data distributions, which can be seen as a form of transfer learning that improves generalization (Andrychowicz et al. 2016). Therefore, we exploit the MAML framework to facilitate knowledge transfer and generalization across tasks. Moreover, we harness its fast adaption ability to address the challenge of resource scarcity.

Learning to fast adapt: We approximately align the meta-objective with the objectives shown in Eqs. 1 and 2 as follows,

$$\begin{aligned} \min _{\theta _{t}} \quad&\frac{1}{|t|} \sum _{i=1}^{t} \mathbb {E}_{ (x,y) \sim \mathcal {D}_{i}} [\mathcal {L}\big (f_{\tilde{\theta }_t}(x),y \big )] \nonumber \\ \text {s.t.} \quad&\tilde{\theta }_t = U_{\mathcal {T}_t, \mathcal {D}_t}(\theta _{t}) \end{aligned}$$
(3)

where \(U_{\mathcal {T}_t, \mathcal {D}_t}(\theta _{t})\) is update operation on \(\theta _{t}\) using training set \(\mathcal {D}_t\). Specifically, \(U_{\mathcal {T}_t, \mathcal {D}_t}(\theta _{t})\) describes optimization acting on \(\theta _{t}\) as

$$\begin{aligned} U_{\mathcal {T}_t, \mathcal {D}_t}(\theta _{t}) = \theta _t - \alpha \nabla _{\theta _t} \mathbb {E}_{(x,y) \sim \mathcal {D}_t}\mathcal {L} (f_{\theta _t}(x),y) \end{aligned}$$
(4)

hereby, instead of finding tuned \(\tilde{\theta }_t\), model f learns an optimal initialization \(\theta _{t}\) that can effectively adapt to \(\mathcal {D}_{1:t}\) with few labeled examples.

Learning to continually learn: In general, the full datasets \(\mathcal {D}_{1:t-1}\) are not available while learning \(\mathcal {D}_{t}\). Thus, we retrieve a small subset of past samples \(\mathcal {M}\) for experience replay in meta-objective,

$$\begin{aligned} \min _{\theta _{t}} \quad&\sum _{(x,y) \in \mathcal {M}} [\mathcal {L}\big (f_{\tilde{\theta }_t}(x),y \big )] \nonumber \\ \text {s.t.} \quad&\tilde{\theta }_t = U_{\mathcal {T}_t, \mathcal {D}_t}(\theta _{t}) \nonumber \\&= \theta _t - \alpha \nabla _{\theta _t} \mathbb {E}_{(x,y) \sim \mathcal {D}_t}\mathcal {L} (f_{\theta _t}(x),y) \end{aligned}$$
(5)

In such a way, we leverage the selected examples from the past to constrain the learning behaviour of f. As a result, it effectively tackles the problems of inter-task confusion and catastrophic forgetting.

Learning to generalize: We also exploit data augmentations to enhance model’s ability to generalize. Especially, we apply consistency regularization (Bachman et al. 2014), which is built on the assumption that perturbations of the same input should not affect the output. Inspired by FixMatch (Sohn et al. 2020), we employ two types of data augmentations in meta-training, strong and weak, denoted by \(\mathcal {A}(\cdot )\) and \(\alpha (\cdot )\), respectively. In contrast to FixMatch, we employ textual augmentations under full supervision to ensure generalization with limited data availability. Specifically, weak and/or strong augmentations are applied in the inner loop to enhance intra-task generalization, while strong augmentations are used in the outer loop to improve both intra- and inter-task generalization. More details are shown in §5.1.

5 Model

Model architecture: Following Online-aware Meta-Learning (OML) (Javed and White 2019), the proposed model \(f_{\theta }\) consists of a representation learning network \(h_{\theta _{\textrm{e}}}\) with a learnable parameter set \(\theta _{\textrm{e}}\) and a prediction network \(g_{\theta _{\textrm{clf}}}\) with a learnable parameter set \(\theta _{\textrm{clf}}\). The model f is described as \(f_{\theta }(x) = g_{\theta _{\textrm{clf}}} (h_{\theta _{\textrm{e}}}(x))\). The representation learning network acts as an encoder. The prediction network is a single linear layer followed by a softmax.

5.1 Training

Each episode has m batches of examples instantaneously drawn from data stream. For each task, our model is trained on \(\mathcal {D}^{label}\) first, and then on \(\mathcal {D}^{pool}\). MAML consists of two optimization loops:

5.1.1 Inner-loop optimization

Inner-loop algorithm performs task-specific tuning. We introduce data augmentations as the regularization term to improve intra-task generalization. The inner loop loss for training samples \(\mathcal {D}_{i}^{label}\) at time step i is

$$\begin{aligned} \small \mathcal {L}_{\textbf{inner}}^{D_{i}^{label}}(\theta ) = \sum _{ (x,y) \in D_{i}^{label}} [ w \mathcal {L}_{CE}(f_{\theta }(x),y ) + (1-w) \mathcal {L}_{CE}\big (f_{\theta }(\alpha (x)),y \big )], \end{aligned}$$
(6)

where w denotes the relative weight and \(\mathcal {L}_{CE}\) is the cross-entropy loss.

Annotation process

When the received training batches are unlabeled, we apply acquisition function \(a(\mathcal {D}_{i}^{pool}, m \cdot n_{a})\) to select informative data points for annotation. The selection size per batch for annotation \(n_{a} = \lceil \frac{b\cdot B_{A}}{|\mathcal {D}^{pool}|} \rceil\) where b is the batch size and \(B_A\) is the annotation budget.

Then, the inner-loop loss for newly annotated training batches \(\mathcal {D}^{new}_i\) is

$$\begin{aligned} \small \mathcal {L}_{\textbf{inner}}^{\mathcal {D}^{new}_i}(\theta ) = \sum _{ (x,y) \in \mathcal {D}^{new}_i} [w \mathcal {L}_{CE}\big (f_{\theta }(x),y\big ) + (1-w) \mathcal {L}_{CE}\big (f_{\theta }(\mathcal {A}(x)),f_{\theta }(\alpha (x))\big )] \end{aligned}$$
(7)

We use different inner loop loss for already-labeled data \(\mathcal {D}_{i}^{label}\) and newly-labeled data \(\mathcal {D}^{new}_i\). The reason is that newly labeled data may contains more accurate and up-to-date label information. Then, our model performs SGD on parameter set \(\theta _{\textrm{clf}}\) with learning rate \(\alpha\) as

$$\begin{aligned} \tilde{\theta }_{\textrm{clf}}=\theta _{\textrm{clf}} - \alpha \nabla _{\theta } \mathcal {L}_{\textbf{inner}}(\theta ) \end{aligned}$$
(8)

Memory sample selection.

We dynamically update memory buffer \(\mathcal {M}\) using reservoir sampling to ensure generalization while avoiding overfitting. Reservoir sampling (Riemer et al. 2019) randomly selects a fixed number of training samples without knowing the total number of samples in advance. We use reservoir sampling to select \(n_s\) examples per class on the incoming data stream \(\mathcal {D}^{label}_i \bigcup \mathcal {D}^{new}_i\) with an equal selection probability for all data seen so far. Note that the current label space \(\mathcal {Y}_i\) might overlap with previous label spaces. We automatically update memory samples for \(\mathcal {Y}_i\) to maintain a fixed amount of memory samples per class.

Algorithm 1
figure a

Continual active learning

5.1.2 Outer-loop optimization

Outer-loop algorithm optimizes initial parameter set \(\theta\) to a setting that f can effectively adapt to \(\mathcal {D}_{1:t}\) via a few gradient updates. Due to memory constraints, we only have a few amount of samples from the past. The model reads all examples from memory buffer \(\mathcal {M}\). Then, the outer-loop objective is to have \(\tilde{\theta } = \theta _{\textrm{e}} \cup \tilde{\theta }_{\textrm{clf}}\) generalize well on \(\mathcal {D}_{1:t}\) using \(\mathcal {M}\), shown in Eq. 5,

$$\begin{aligned} \begin{aligned} \mathcal {L}^{\mathcal {M}}_{\textbf{meta}}(\tilde{\theta })&= \sum _{(x,y) \in \mathcal {M}} [\mathcal {L}_{CE} (f_{\tilde{\theta }}(x),y)] \\&= \sum _{(x,y) \in \mathcal {M}} [\mathcal {L}_{CE}(g_{\tilde{\theta }_{\textrm{clf}}} (h_{\theta _{\textrm{e}}}(x)),y)] \\ \end{aligned} \end{aligned}$$
(9)

To improve both intra- and inter-task generalization, we employ strong augmentation to memory samples in meta-objective as

$$\begin{aligned} \mathcal {L}^{\mathcal {M}}_{\textbf{meta}}(\tilde{\theta }) = \sum _{(x,y) \in \mathcal {M}} [\mathcal {L}_{CE}(g_{\tilde{\theta }_{\textrm{clf}}} (h_{\theta _{\textrm{e}}}(\mathcal {A}(x)),y)] \end{aligned}$$
(10)

To reduce the complexity of the second-order computation in outer loop, we use first-order approximation, namely FOMAML. The outer-loop optimization process is

$$\begin{aligned} \theta \leftarrow \theta - \beta \nabla _{\tilde{\theta }} \mathcal {L}^{\mathcal {M}}_{\textbf{meta}}(\tilde{\theta }), \end{aligned}$$
(11)

where \(\beta\) is the outer-loop learning rate. Algorithm 1 outlines the complete training procedure.

5.2 Testing

The model randomly samples m batches of examples drawn from \(\mathcal {M}\) as support set S and performs SGD on these samples to finetune parameter set \(\theta _{\textrm{clf}}\) with learning rate \(\alpha\). The inner-loop loss at test is

$$\begin{aligned} \mathcal {L}_{\textbf{inner}}^{S}(\theta ) = \sum _{ (x,y) \in S}[\mathcal {L}_{CE}(f_{\theta }(x),y)], \end{aligned}$$
(12)

where \(S \subseteq \mathcal {M}\). And the optimization process is,

$$\begin{aligned} \tilde{\theta }_{\textrm{clf}} =\theta _{\textrm{clf}} - \alpha \nabla _{\theta } \mathcal {L}_{\textbf{inner}}^{S}(\theta ) \end{aligned}$$
(13)

Then, we output the predication using parameter set \(\tilde{\theta } = \theta _{\textrm{e}} \cup \tilde{\theta }_{\textrm{clf}}\) on the a test sample \(x_{\textrm{test}}\) as

$$\begin{aligned} \hat{y}_{\textrm{test}} = f_{\tilde{\theta }}(x_{\textrm{test}})= g_{\tilde{\theta }_{\textrm{clf}}} (h_{\theta _{\textrm{e}}}(x_{\textrm{test}})) \end{aligned}$$
(14)

6 Experiments

6.1 Datasets

We use the text classification benchmark datasets from Zhang et al. (2015), including AGNews (news classification; 4 classes), Yelp (sentiment analysis; 5 classes), Amazon (sentiment analysis; 5 classes), DBpedia (Wikipedia article classification; 14 classes) and Yahoo (questions and answers categorization; 10 classes). This collection contains 5 tasks from 4 different domains, which covers class- and domain-incremental learning in task sequence. We randomly sample 5 labeled instances per class, 10,000 unlabeled instances, and 7600 test examples from each of the datasets. Following prior studies (Wang et al. 2020; Holla et al. 2020; Ho et al. 2023), we concatenate training sets in 4 different orderings as shown in Table 1.

Table 1 Input dataset orders

6.2 Implementation details

Our example encoder is a pretrained \(\hbox {BERT}_{\textrm{BASE}}\) model (Devlin et al. 2019). The parameter size of our model is 109 M. For learning rates, \(\alpha = 1e^{-3}\) and \(\beta = 3e^{-5}\). The training batch size is 16. The number of mini-batches in each episode, m = 5. The label budget \(B_{A}\) = 2000 examples per task and the memory budget \(B_\mathcal {M}\) is 5 per class, i.e., \(n_s = 5\). For textual augmentation, we swap words randomly as the weak augmentation. And, we apply the combination of swapping words randomly, deleting words randomly and substituting words by WordNet’s synonym, as the strong augmentation. We use nplug,Footnote 1 a Python package to implement augmentations. All models are executed on Linux platform with 8 Nvidia Tesla A100 GPU and 40 GB of RAM. All experiments are performed using PyTorchFootnote 2 (Paszke et al. 2019).

6.3 Baselines

Baseline model:

  1. 1.

    MAML-SEQ: Online FOMAML algorithm.

  2. 2.

    OML-ER (Holla et al. 2020): OML with 5% episodic experience replay rateFootnote 3 + reservoir sampling.

  3. 3.

    C-MAMLFootnote 4 (Gupta et al. 2020): OML with Meta & CL objective alignment + reservoir sampling.

  4. 4.

    Meta-CAL (ours): OML with Meta & CL objective alignment + consistency regularization + reservoir sampling.

  5. 5.

    FULL: Supervised C-MAML, trained on full datasets.

Memory sample selection:

  1. 1.

    Prototype (Ho et al. 2023): Select representative samples that are closest to dynamically updated prototypes in representation space.

  2. 2.

    Ring Buffer (Chaudhry et al. 2019): Use a ’First-In, First-Out’ scheme to update buffer.

  3. 3.

    Reservoir Sampling (Riemer et al. 2019): Randomly select data with an equal selection probability.

6.4 Evaluation metrics

Based on the prior work (Wang et al. 2024), we use three comprehensive and widely-used metrics in CL, i.e., accuracy, backward transfer and forward transfer. Let \(R_{k,i}\) be the macro-averaged accuracy evaluated on the test set of the i task after sequentially learning t tasks,

Accuracy:

$$\begin{aligned} \textrm{ACC}_t = \frac{1}{t} \sum ^{t}_{i=1} R_{t,i} \end{aligned}$$

Overall accuracy is the weighted average accuracy of all seen tasks \(\mathcal {T}_{1:T}\).

Backward transfer (stability evaluation):

$$\begin{aligned} \textrm{BWT}_k = \frac{1}{k-1} \sum ^{k-1}_{i=1} (R_{k,i} - R_{i,i}) \end{aligned}$$

BWT measures how the updated parameters affect model performance on all previously seen tasks.

Forward transfer (plasticity evaluation):

$$\begin{aligned} \textrm{FWT}_k = \frac{1}{k-1} \sum ^{k}_{i=2} (R_{i,i} - R_{0,i}) \end{aligned}$$

FWT quantifies the average impact of all preceding tasks on the current k task.

6.5 Main results

In Table 2, we compare various models with four AL strategies. These strategies include Random (denoted as RAND), Representative via KMeans (denoted as REP), Diversity via KMeans (denoted as DIV), and Uncertainty via Least-confidence (denoted as UNC). The label budget \(B_{A}\) = 2000 examples per task. Each record is the average of three best results from five runs.

Table 2 Average accuracy of four training set orders

The result shows that our method yields comparable results to the FULL baseline, which is trained on more than 10,000 labeled samples per task. This result indicates our method can effectively select 2000 informative samples from 10,000 unlabeled data to maximize model performance. Compared to other baselines with the same MAML framework, our method obtains the highest average accuracy among the four commonly-used AL strategies, demonstrating its robustness to different AL approaches. It also indicates Meta-CAL has a great ability of preventing catastrophic forgetting after sequentially learning five tasks.

We also perform paired t-tests. Our model with default setting is significantly better than C-MAML with p-values < 0.04 for all four AL strategies. We also compare AL strategies for the proposed model, Random is significantly better than other strategies with p-values < 0.03.

As for the memory sample selection schemes, Ring Buffer outperforms Reservoir Sampling by less than 1% in RAND and REP. Whereas, Reservoir Sampling exhibits smaller standard deviations, indicating its robustness to training set orders. Prototype Sampling has the worst performance. We perform further analysis in §7.2.

Table 3 Comparison of different AL strategies

Tables 2 and 3 demonstrate that Random is the best AL strategy for our model. As shown in Table 3, for both Diversity- and Representative-based AL strategies, KMeans is the most effective approach. Uncertainty-based methods comparatively perform poorly. CL models aim to revisit past samples to enhance generalization. However, replaying and annotating uncertain samples may not sufficiently improve generalization capabilities, which can hinder effective knowledge consolidation and consequently harms model performance.

7 Further analysis

We use the training set orderFootnote 5 Yelp \(\rightarrow\) AGNews \(\rightarrow\) DBpedia \(\rightarrow\) Amazon \(\rightarrow\) Yahoo to perform further analysis. In this section, we denote RAND, REP, DIV and UNC as Random, Representative (KMeans), Diversity (KMeans) and Uncertainty (Least-confidence), respectively.

7.1 Stability & plasticity

We test the performance of different AL strategies in terms of stability and plasticity.

Fig. 1
figure 1

Per task accuracy at different learning stages. The dark color indicates high accuracy. From left to right, the color for each task progressively fades away, indicating forgetting happens while learning more tasks. Note that Yelp and Amazon are from the same domain (sentiment analysis). UNC shows a lighter color compared to other AL methods, indicating a higher degree of forgetting. The red-outlined box shows the accuracy on AGNews (Task 2) after learning DBpedia (Task 3). (Color figure online)

Fig. 2
figure 2

BWT and FWT for different AL strategies at each learning stage. (Color figure online)

Stability: We employ the BWT metric as an indicator of catastrophic forgetting, evaluating the impact of the updated model parameters on the performance across all previously learned tasks. The negative BWT values indicate forgetting. As shown in Fig. 2, RAND shows the least forgetting, which indicates the best stability on past tasks. However, all methods exhibit a large forgetting after sequentially learning Task 3. Task 3 (DBpedia, 14 classes) has a relatively large label space compared to Task 1 (Yelp, 5 classes) and Task 2 (AGNews, 4 classes). As shown in Fig. 1, the accuracy on DBpedia exceeds the accuracy on other tasks. Since MAML learns high-quality reusable features for fast adaptation (Raghu et al. 2020), it tends to find shared representations that specifically benefit tasks with large label spaces.

Plasticity: A positive FWT score indicates the model’s ability to leverage knowledge from previously learned tasks, facilitating zero-shot learning and efficient adaptation to the new task. As shown in Fig. 2, RAND, REP, and DIV show positive FWT values after training Task 4. This observation indicates that successful forward knowledge transfer from the previously learned tasks occurs. Figure 1 further supports that this transfer mainly occurs within the same domain, demonstrating effective domain-incremental learning capabilities. Specifically, the proposed method leverages prior knowledge acquired from a familiar domain to facilitate efficient adaptation to new tasks within that domain.

Overall, RAND shows the best stability while REP shows the best performance in plasticity.

7.2 Memory insight

The level of generalization can be inferred from data dispersion. In this section, we investigate the effect of memory samples resulting from active learning and memory sample selection strategies.

Fig. 3
figure 3

T-SNE visualization of memory samples at different learning stages using training set order Yelp \(\rightarrow\) AGNews \(\rightarrow\) DBpedia \(\rightarrow\) Amazon \(\rightarrow\) Yahoo. The black-circled data points belong to the last task. Data points with darker colors represent samples from earlier tasks, except (d) after learning Task 4. Task 4 has the same domain as Task 1. Hence, data points with darker colors are belong to the latest task in (d). (Color figure online)

Different learning stages: Fig. 3 presents the T-SNE visualization of memory samples at different learning stages. As show in Fig. 3a, b, when the proposed model learns a small number of tasks or encounters small label spaces, it focuses on ensuring intra-task generalization. However, as the number of seen tasks increases, the model shifts its focus towards ensuring inter-task generalization as shown in Fig. 3d, e. Consequently, we observe a phenomenon wherein the memory samples from the last tasks cluster together.

Fig. 4
figure 4

T-SNE visualization of memory samples with different AL strategies. A equal dispersion of the data indicates a good memory representation. We provide accuracy for a better comparison. Data points with darker colors (purple, violet and pink) represent samples from earlier tasks. The black-circled data points belong to the last task. The clustering of memory samples from the last task suggests model focuses more on inter-task generalization rather than on intra-task generalization. (Color figure online)

AL strategies: As shown in Fig. 4a–d, memory data in RAND show a good data dispersion within the last task (i.e., intra-task generalization) and across multiple tasks (i.e., inter-task generalization), resulting in the best accuracy. In contrast, UNC in Fig. 4d shows subpar performance in inter-task generalization. The result validates UNC’s incapability to achieve generalization and preserve knowledge. Both REP and DIV exhibit lower accuracy compared to RAND, with DIV showing an obvious lower sparsity. Therefore, it is important to find an optimal balance between representativeness and diversity to ensure generalization.

Memory sample selection methods: In Fig. 4a, e and f, the choice of memory sample selection method also affects the model’s performance. Prototype Sampling selects representative memory samples, potentially resulting in relatively less sparsity across multiple tasks. In contrast, Reservoir and Ring Buffer sampling strategies introduce randomness into the memory sample selection process, achieving better inter-task generalization.

Therefore, the randomness introduced by AL and memory sample selection methods can be beneficial in consolidating knowledge by ensuring generalization. Furthermore, it is noteworthy that inter-task confusion does not occur in the learned embedding space. This validates the superiority of our method in addressing confusion between old and new tasks.

7.3 Effect of augmentations

We conduct an ablation study to analyze textual augmentations and their key roles in Meta-CAL. We examine the effect of inner-loop augmentation and outer loop augmentation in Table 4. The meta samples in the outer-loop constrain the model behaviour.

Table 4 Ablation study with different augmentation modules

We hypothesize that if the meta-samples exhibit sufficient generalization capabilities, they can facilitate effective knowledge retention from prior tasks, consequently mitigating the issue of catastrophic forgetting. It is noteworthy that the meta-samples acquired through the combined effect of active learning acquisition and memory sample selection. We further enhance their generalization capabilities through data augmentation. The results validate this hypothesis, as we observe a significant gain by using outer-loop augmentation to enhance generalization. It improves accuracy by approximately 10% when data availability and label budget are extremely limited.

While inner-loop augmentation proves effective in extreme cases, it might not be as advantageous as outer-loop augmentation in non-extreme scenarios. Nevertheless, the combination of inner- and outer-loop augmentations still significantly improves accuracy.

7.4 Annotation budgets

As shown in Fig. 5, in contrast to other AL strategies, increasing annotation budgets for UNC degrades model performance. This demonstrates that annotating uncertain samples is not beneficial for knowledge retention. REP achieves the best performance when the label budget is 500, while RAND outperforms other methods in most cases. Consequently, a certain degree of representativeness can aid in knowledge consolidation when annotation budget is extremely limited. In addition, our model achieves more than 62% when the label budget is only 500 samples per task, suggesting its fast adaptation ability.

Fig. 5
figure 5

Performance on different annotation budgets. (Color figure online)

7.5 Memory budgets

We evaluate memory efficiency of our model. Table 5 compares the performance of three models with the best average accuracy, i.e., our model with Reservoir sampling, our model with Ring Buffer and C-MAML. While all these models use the MAML framework, only the Meta-CAL models employ consistency regularization to enhance generalization. The result shows Meta-CAL models outperform C-MAML in all three cases. Notably, Meta-CAL w/. Reservoir can attain more than 50% accuracy by saving only one sample per seen class. As a result, it also indicates that consistency regularization through data augmentation improves performance.

Table 5 Accuracy on different memory budgets. The AL strategy is RAND

8 Conclusion

This paper considers a realistic continual learning scenario, namely few-shot continual active learning. To address this resource-constrained continual learning problems, we propose a novel method that employing meta-learning and active learning techniques, called Meta-Continual Active Learning. Specifically, our model dynamically queries worthwhile unlabeled data for annotation and reformulate meta-objective with experience rehearsal and consistency regularization. In such a way, the proposed method is able to prevent catastrophic forgetting and improve generalization capability in a low-resource scenario. We conduct extensive experiments on benchmark text classification datasets in a 5-shot continual active learning setting. The results show the robustness of the proposed method. However, we only evaluate our method in a 5-shot case. In the future work, we can extend our evaluation to a more realistic scenario where the amount of labeled data is {100, 1000, 5000} and extend to other NLP tasks, e.g., language model training, text generation, and knowledge base enrichment. Furthermore, the annotation budget is allocated equally for each task. The annotation budget allocation strategies can be further studied.