1 Introduction

Continual learning (CL) is a learning paradigm that aims to mimic the human abilities of adapting to new environments while not forgetting past experience (Delange et al., 2021; Peng et al., 2021; Kong et al., 2022; Wang et al., 2022b, c, d). However, when the network is exposed to a sequence of tasks sequentially, the performance of the old tasks would drop significantly, which is referred to as catastrophic forgetting (McCloskey &Cohen, 1989; Ratcliff, 1990), a well-known challenge in continual learning. The problem closely involves the stability-plasticity dilemma (Grossberg, 1982; Mermillod et al., 2013), where with limited resources, the network is infeasible to have plasticity to learn a new task well and stability to maintain useful knowledge learned from past tasks simultaneously.

To alleviate the stability-plasticity dilemma, three classes of continual learning approaches are proposed: replay-based methods, architecture-based methods, and regularization-based methods. In particular, replay-based methods store historical data in a limited coreset and replay the data alongside new data (Rolnick et al., 2019; Isele &Cosgun, 2018; Chaudhry et al., 2019; Lopez-Paz &Ranzato, 2017), as done by Experience Replay (ER) (Riemer et al., 2018) and Dark Experience Replay (DER) (Buzzega et al., 2020). Architecture-based methods augment the networks or allocate subnetworks for new tasks to decrease the interference on past works (Zhou et al., 2012; Jerfel et al., 2019; Mallya &Lazebnik, 2018), e.g., Progressive Neural Networks (PNN) (Rusu et al., 2016). Regularization-based methods prevent the variants of important parameters by adding a penalty term to the loss function (Kirkpatrick et al., 2017; Zenke et al., 2017; Aljundi et al., 2018; Lee et al., 2017), like Elastic Weight Consolidation (EWC) (Schwarz et al., 2018).

However, most of the approaches are tailed for offline setting, where the model could iterate the entire dataset of each task multiple times. That is, the model can access the whole dataset of the current task at a time and needs additional storage to store the whole current dataset, which is not realistic. Therefore, we consider a more restrictive but practical setting, online continual learning (online CL). Specifically, online CL requires continual learning algorithms to observe the data of each task in a single pass while previous data are unavailable.

Fig. 1
figure 1

The full procedure of the proposed method. The procedure includes Standard-Process (S-P) and Intra-Process (I-P), where S-P updates the model based on the current data and coreset, and I-P updates based on coreset. The frequency of I-P is dependent on the confidence score S. When the average of the confidence score is higher than the threshold \(\epsilon _{\text {fre}}\), the frequency is gradually increased. Otherwise, we decrease the frequency. Low frequency is referred to the situation where the number of standard-process is much larger than the I-P within a certain time and vice versa. The changes of the confidence score are only used for illustration

Considering the excellent performance of replay-based methods in continual learning, in this paper, we devote to replay-based method that stores a subset of historical data in a coreset. The typical problem of replay-based methods is data imbalance where the data of the current task and the data of previous tasks are imbalanced due to the inaccessibility of the old data and the small size of coreset. Moreover, when applied to online fashion, replay-based methods further face more challenges, preventing them from achieving a good stability-plasticity trade-off. For example, the model only observes the data stream from sequential tasks in a single pass, resulting in unsatisfactory learning of tasks (poor plasticity) and severe catastrophic forgetting (poor stability). Moreover, for online CL, the typical sampling strategy, reservoir randomly samples a uniform subset from the input stream and would omit the representative and informative data of old tasks, resulting in more forgetting of previous tasks.

Therefore, to overcome the challenges, we propose a new online continual learning approach, Trust-Region Adaptive Frequency (TRAF), which alternates between standard-process and intra-process updates based on a trust region. Specifically, the standard-process trains from data stored in a coreset and interleaves the data with current data, and the intra-process updates the network parameters based on the coreset. By triggering the intra-process during the standard-process, the model could improve the performance of tasks stemming from insufficient learning and alleviate the forgetting on previous tasks simultaneously. Moreover, to alleviate the data imbalance, intra-process is better to be triggered more frequently in the stage that the coreset is more balanced. We propose a trust-region inspired approach (Nocedal and Wright, 2006; Conn et al., 2000; Cartis et al., 2011), measured by confidence score, to detect the stage and adjust the frequency of the intra-process based on the trust region. During intra-process, we further distill the dark knowledge to retain learned knowledge. Finally, considering the importance of storing data, we introduce a confidence-based coreset selection to store more representative samples to further alleviate the forgetting. The full procedure of the proposed method is shown in Fig. 1.

The experimental results on different benchmarks demonstrate that TRAF could outperform existing competitive continual learning algorithms by a considerable margin. To summarize, our contributions are threefold:

  • For online CL, we propose a new online CL method, Trust-Region Adaptive Frequency (TRAF), which alternates between standard-process and intra-process updates based on a trust region, to relieve the stability-plasticity dilemma.

  • To further improve performance, TRAF also uses confidence-based coreset selection to select more representative data.

  • Extensive experimental results on two standard protocols and several standard benchmarks show that the proposed method could achieve state-of-art performance.

2 Related Works

2.1 Continual Learning Approaches

Replay-based methods are a prominent class of continual learning approaches and achieve state-of-the-art performance in many challenging scenarios. Specifically, replay-based methods maintain a small memory buffer to store data and train the historical data interleaved with the new data at the latter training iterations (Rolnick et al., 2019; Isele &Cosgun, 2018; Chaudhry et al., 2019; Shin et al., 2017; Rao et al., 2019; Aljundi et al., 2019a, 2017; Hou et al., 2018; Ostapenko et al., 2019; Bang et al., 2021). For instance, Experience Replay (ER) is the most classical approach that jointly optimizes the model parameters by replaying the old data alongside new data. Gradient Episodic Memory (GEM) (Lopez-Paz &Ranzato, 2017) and Averaged-GEM (AGEM) (Chaudhry et al., 2018b) update the model under inequality constraints of gradients, which is computed by the gradients of the stored samples. Incremental Classifier and Representation Learning (iCaRL) (Rebuffi et al., 2017) learns in a class-incremental way by storing samples that are close to the center of each class. Rethinking-Experience Replay (RE-ER) (Buzzega et al., 2021) proposes several simple techniques to tackle the existing challenges in ER. Gradient based Sample Selection (GSS) (Aljundi et al., 2019b) focuses on the selection strategy and proposes a variation on ER from the view of constrained optimization. Meta-Experience Replay (MER) (Riemer et al., 2019) combines ER with optimization-based meta-learning to maximize transfer from the past tasks while minimizing interference. DER++ (Buzzega et al., 2020) promotes consistency with the past by matching the network’s outputs selected throughout the optimization trajectory. Unsupervised Continual Learning (UCL) (Madaan et al., 2022) mixes up new examples with past examples to mitigate the forgetting. Among the mentioned methods, all methods are applicable or easily modified to the setting of online CL, except for iCaRL and GEM.

Architecture-based methods expand the network progressively when needed or allocate different parameters for different tasks (Serra et al., 2018; Mallya &Lazebnik, 2018; Mallya et al., 2018; Li et al., 2019; Zhou et al., 2012; Wu et al., 2020; Yoon et al., 2019). For example, PNN (Rusu et al., 2016) expands the networks when the new task comes, and retain the networks learned on past tasks. Wu et al. (2020) progressively and dynamically grows neural networks by jointly optimizing the network. However, these methods may result in a cumbersome and complex model if new tasks continually arrive.

Regularization-based methods add a penalty term to the loss function to prevent the changes of network parameters (Chaudhry et al., 2018a; Yin et al., 2020; Nguyen et al., 2017; Ritter et al., 2018; Lin et al., 2022). For example, EWC (Kirkpatrick et al., 2017), SI (Zenke et al., 2017), MAS (Aljundi et al., 2018), and ALASSO (Park et al., 2019) are devoted to the computation of parameters’ importance while LwF (Li and Hoiem, 2017) aims to distill the knowledge without storing old data. However, these approaches may lead to unsatisfactory performance without access to previous data, especially in challenging scenarios.

Fig. 2
figure 2

The illustration of online data stream where the data of each task arrives sequentially and each data can be only observed once

2.2 Online Continual Learning

While the majority of continual learning methods are designed for unsuitable scenarios, where the model can iterate on the entire dataset of the current task multiple times (Zenke et al., 2017; Schwarz et al., 2018; Rusu et al., 2016; Rebuffi et al., 2017), online continual learning (online CL) has been gaining much interest recently due to its ubiquitous in many real-world problems. In this paper, we consider a challenging task that is more restrictive, i.e., online CL (Jin et al., 2021; Sun et al., 2022; Aljundi et al., 2019b). In online CL, the model observes the data of each task in a single pass and previous data are unavailable.

Moreover, recent works (Delange et al., 2021; Buzzega et al., 2020) also provided the requirements that the continual learning methods should focus on to be more applicable in practical: (a) no task boundaries: the model does not rely on the task boundaries. (b) constant memory: the memory is bounded throughout the entire training phase. (c) no test time oracle: the task identities which are used to select the relevant task for each image are not accessible at inference time. Our setting follows the guidelines and according to the fact that whether the task identities (time oracle) or not, we divide the scenario into two protocols: Task-Aware (with task identities) and Task-Free (without task identities) (Pham et al., 2021), and evaluate the proposed method on both protocols.

3 Problem Setting

In this section, we present the setting of online continual learning. Figure 2 shows the illustration of online CL. Formally, in online CL, the model is learned on a sequence of image classification tasks \(\mathcal {T} = \{\mathcal {T}_1,\ldots , \mathcal {T}_T\}\), where T is the total number of tasks and \(\mathcal {T}\) is the tasks set. For task \(\mathcal {T}_t\), the input samples x and the corresponding labels y are drawn from the independently and identically distributed distribution of task \(\mathcal {T}_t\). Let \(\mathcal {D}_t\) be the dataset of task \(\mathcal {T}_t\), and \(\mathcal {D} \) be the corresponding online data stream consisting of all datasets \(\mathcal {D}_t, t \in \{1,2,\ldots ,T\}\), sequentially. Note that the task boundaries are not provided to indicate the coming of a new task during training. The model is trained on a sequence of batches \(\{\mathcal {B}_1, \mathcal {B}_2,...\}\) from \(\mathcal {D}\) with each data seen once.

Let \(\theta \) be the model parameters and \(\mathcal {N}\) be the network. In this paper, we focus on replay-based methods, a prominent class of approaches in continual learning, which store a subset of past data in a limited replay coreset \(\mathcal {C}\) and replay the data in the future (Buzzega et al., 2020; Madaan et al., 2022; Buzzega et al., 2021). \(|\mathcal {A}|\) denotes the datasize of \(\mathcal {A}\).

4 Methodology

In this section, we depict the proposed method, Trust-Region Adaptive Frequency (TRAF), which alternates between standard-process and intra-process updates in an adaptive frequency. We first describe the standard-process and intra-process (Sect. 4.1) and then introduce the trust-region adaptive frequency for intra-process (Sect. 4.2), the key idea of our work. To further alleviate the catastrophic forgetting, we also propose confidence-based coreset selection to select more representative data (Sect. 4.3). Finally, we discuss the difference between our work and some related works (Sect. 4.4).

figure a
Fig. 3
figure 3

The illustration of distribution changes in the coreset during the training of a task. Assume that the class of the current task is Airplane. The samples in the coreset are all from old tasks at the beginning of current task learning and then the fraction of the samples from the current task will gradually increase until the end of the task. Finally, the samples of each class will be almost equal

4.1 Standard-Process and Intra-Process

4.1.1 Standard-Process

In this subsection, we first introduce the experience replay (ER), the most typical replay-based method, which stores a subset of historical data across encountered tasks and optimizes the network with the historical data and current data during training. Formally, when training on the current batch \(\mathcal {B}\), the objective can be represented as the following:

$$\begin{aligned} \mathcal {L}(\mathcal {B}; \theta ) + \beta \mathcal {L}(\mathcal {C}; \theta ), \end{aligned}$$
(1)

where \(\mathcal {L}\) is cross entropy loss, \(\mathcal {C}\) is the coreset containing the stored training samples, \(\theta \) are the model parameters, and \(\beta \) is a factor that controls the balance between the new task and past tasks. We call the updates in Eq. (1) as Standard-Process (S-P).

ER and its variants (Buzzega et al., 2020; Arani et al., 2022; Buzzega et al., 2021) have achieved impressive achievements in conventional continual learning. However, they are still hard to achieve satisfactory stability-plasticity trade-off on the online CL. Specifically, the model observes batch \(\mathcal {B}\) sequentially, where each batch is seen once, resulting in insufficient learning of the current task (poor plasticity). Moreover, except for the historical data stored in the coreset, the model can only access the data of the current task, leading to more attention to the classes in the current task. The phenomenon would result in more severe catastrophic forgetting of previous classes (poor stability).

4.1.2 Intra-Process

Inspired by previous works that multiple iterations on the data can improve the unsatisfactory performance (Tang et al., 2021), to maximally utilize the limited data in the coreset, we introduce a new process, Intra-Process (I-P), updating the model parameters under the coreset. Formally, the loss function for the intra-process is \( \mathcal {L}(\mathcal {C}; \theta ), \) where \(\mathcal {L}\) is the cross entropy loss, \(\mathcal {C}\) is the coreset containing the stored training samples.

To improve the insufficient learning of tasks, we alternate the standard-process with intra-process throughout the optimization trajectory. Specifically, we trigger the intra-process at a certain frequency during the training of the standard-process, where the frequency corresponds to the triggers of intra-process in certain iterations. We define the trigger function as

$$\begin{aligned} \mathbb {I}(k, \textit{inv}) = \left\{ \begin{array}{rl} 1, &{} \quad k \quad \textit{mod} \quad \textit{inv} = 0, \\ 0, &{} \quad \text {otherwise}, \end{array} \right. \end{aligned}$$
(2)

where \( \textit{mod}\) is the operation of modulo, k is the current iteration number and inv is an integer that controls the frequency and it is negatively related to the frequency, i.e., larger inv corresponds to lower frequency.

4.2 Trust-Region Adaptive Frequency for Intra-Process

If intra-process updates parameters based on the more balanced coreset, it can alleviate the negative impact brought by data imbalance. However, due to the online setting, the class balance in the coreset varies at the different learning stages of the current task. As found in our experiments and shown in Fig. 3, the class distribution in coreset is more uniform for all observed classes at the late stage of the current task learning. Therefore, intra-process should be triggered more frequently at the late stage of the current task to alleviate the negative impact brought by data imbalance better. However, in online learning, the boundaries of the task are not accessible thus we could not obtain the learning stage of the current task. To this end, we designed an approach to detect the late stage of the current task. We find that, due to online fashion that data can be seen only once, the performance of the model on the current task would be better at the later training stage of the current task. Therefore, we use the performance of the current task to detect the latter stage of the current task. A natural way to measure performance is the confidence score, where a higher confidence score represents better performance. Therefore, to detect the stage, we propose a trust-region that is measured by a confidence score and adjust the frequency dynamically based on the trust region, called trust-region adaptive frequency.

Specifically, to represent the region explicitly, we use the average confidence score, which is the predicted probability of the ground truth label, for the current batch to measure the performance of the current task. When the model is under the trust region, we increase the frequency by decreasing the factor inv and we decrease the frequency if the model is outside the region. Higher score is trusted because it represents better performance, a later stage of current task learning, and a more balanced coreset. Therefore, let \(\mathcal {S}(x; \theta )\) be the confidence score of x. Then the candidate of \(\textit{inv}_{k}\) can be updated by

$$\begin{aligned} {inv}_{k}^{'}\!=\! \left\{ \begin{array}{ll} {{inv}}_{k-1}^{'} - \delta , &{} \quad \text {avg}(\mathcal {S}(\mathcal {B}; \theta )) \ge \epsilon _{\text {fre}}, \\ {inv}_{k-1}^{'} + \delta , &{} \quad \text {otherwise}, \end{array} \right. \end{aligned}$$
(3)

where avg(\(\cdot \)) denotes the average function, \(\epsilon _{\text {fre}}\) is the threshold of the score \(\text {avg}(\mathcal {S}(\mathcal {B}; \theta ))\), k is the current iteration number, \(\delta \) is the amplitude of the frequency update and \(\mathcal {B}\) is the current batch. After obtaining the \(\textit{inv}_{k}^{'}\), we round up or round down it to obtain the \(\textit{inv}_{k}\) used in the trigger function:

$$\begin{aligned} \textit{inv}_{k}= \left\{ \begin{array}{ll} \text {max}\{\lfloor \textit{inv}_{k}^{'} \rfloor , \textit{inv}_{\text {min}}\}, &{} \text {avg}(\mathcal {S}(\mathcal {B}; \theta )) \ge \epsilon _{\text {fre}} \\ \text {min}\{\lceil \textit{inv}_{k}^{'}\rceil , \textit{inv}_{\text {max}} \}, &{} \text {otherwise}, \end{array} \right. \end{aligned}$$
(4)

where \(\lfloor \cdot \rfloor \) and \(\lceil \cdot \rceil \) denote the operations of rounding down and rounding up, respectively; \(\textit{inv}_{\text {max}}\) and \(\textit{inv}_{\text {min}}\) are the maximum and minimum value of \(inv_k\), respectively.

As shown in Fig. 1, when the average confidence score of the current batch is satisfactory, i.e., higher than the threshold \(\epsilon _{\text {fre}}\), we decrease the inv and the corresponding frequency of intra-process is increased. Otherwise, we increase inv and then the frequency is decreased. Note that \(\epsilon _{\text {fre}}\) is an important factor because it determines the trust region. For example, when \(\epsilon _{\text {fre}}\) is large, the performance of the current task is better, and the classes in the coreset are more balanced. However, in the situation, most of the region is in the untrust region, and the triggers of intra-process are lower through the optimization trajetory, impacting the performance of the model. Moreover, \(\epsilon _{\text {fre}}\) is related to the complex of dataset. When the dataset is easy to learn, then \(\epsilon _{\text {fre}}\) should be a larger value since the worse case can also be well classified.

To further relieve forgetting and maintain the useful knowledge learned from the past, we distill the dark knowledge (Buzzega et al., 2020; Gou et al., 2021; Zhao et al., 2021; Wang et al., 2020), called Dark Knowledge Distillation (DKD), during the intra-process. Specifically, we retain the network’s logits and use the modified cross-entropy loss as the distillation loss. During intra-process, we sample the examplers \((x, \tilde{y}_\mathcal {C})\) from the coreset randomly, where \(\hat{y}_\mathcal {C}\) is the record logits of x. Then distillation loss can be represented as:

$$\begin{aligned} \mathcal {L}_{d}(\tilde{y}_\mathcal {C}, \hat{y}_\mathcal {C}) = -\sum _{l=1}^{L} \tilde{y}_\mathcal {C}'^{(l)}\text {log} \hat{y}_\mathcal {C}'^{(l)}, \end{aligned}$$
(5)

where \(\tilde{y}_\mathcal {C}'^{(l)} = \frac{\text {exp}(\tilde{y}_\mathcal {C}^{(l)}/\tau )}{\sum _i \text {exp}(\tilde{y}_\mathcal {C}^{(i)})\tau }, \hat{y}_\mathcal {C}'^{(l)} = \frac{\text {exp}(\hat{y}_\mathcal {C}^{(l)} /\tau )}{\sum _i \text {exp}(\hat{y}_\mathcal {C}^{(i)}/\tau )}\), L is the total number of classes, \(\tau \) is the temperature factor, and \(\tilde{y}_\mathcal {C}\) and \(\hat{y}_\mathcal {C}\) are the record and current logits of x.

To the end, the training procedure can be represented as

$$\begin{aligned}&\text {Standard-Process:} \quad \mathcal {L}(\mathcal {B}; \theta ) + \beta \mathcal {L}(\mathcal {C}; \theta ), \end{aligned}$$
(6)
$$\begin{aligned}&\text {Intra-Process}: \quad \quad \mathcal {L}(\mathcal {C}; \theta ) + \lambda \mathcal {L}_d( \tilde{Y}_\mathcal {C}, \hat{Y}_\mathcal {C}), \end{aligned}$$
(7)

where \(\lambda \) is a factor that controls the importance of distillation; \(\tilde{Y}_\mathcal {C}\) and \(\hat{Y}_\mathcal {C}\) are the recorded and current logits of examples randomly sampled from the coreset \(\mathcal {C}\), respectively; \(\beta \) and \(\lambda \) are balanced hyperparameters which are commonly used in CL (Buzzega et al., 2020). The intra-process is happened when \(\mathbb {I}(k, \textit{inv}) = 1\) (defined in Eq. 2). The procedure is shown in Algorithm 1.

4.3 Confidence-Based Coreset Selection

For replay-based methods, especially in online CL, a key problem is how to choose representative data that are beneficial for future rehearing. A compatible selection strategy for online CL is the reservoir (Vitter, 1985), which randomly smaples a uniform subset from the input stream. Specifically, reservoir randomly chooses \(C=|\mathcal {C}|\) samples to store in the coreset \(\mathcal {C}\), guaranteeing that all seen samples have the same probability \(\frac{C}{N}\) of being stored in the coreset, where N is the number of seen samples participating in the reservoir sample strategy. The algorithm of reservoir is shown in Algorithm 2, where \(randomInteger(\text {min}=0, \text {max}=N)\) denotes the operation that randomly selects an integer between 0 and \(N-1\).

figure b

However, reservoir puts equal importance on all samples, which does not take data representation into consideration. Therefore, we design a simple but effective sampling strategy that could store more representative data, called Confidence-based Coreset Selection (CCS), by storing data with higher confidence scores in an online manner. The confidence-based coreset selection relies on the confidence score to select the samples. However, at the early stage of each task learning, the confidence scores are unreliable because the model does not fit well with the current task. Therefore, we only selectively choose the samples based on the confidence score when the confidence score is reliable, i.e., the average confidence score is higher than a threshold. Or we randomly select the samples to avoid negative effect the brought by the unreliable confidence score. Formally, the indexes \({\textbf {idx}}^*\) of the selected data for the current batch \(\mathcal {B}\) can be formulated as

$$\begin{aligned} {\textbf {idx}}^* {=} \left\{ \begin{array}{ll} \mathop {\text {argmax}^{(m)}}\limits _{n} \mathcal {S}(\mathcal {B}; \theta ), n \in [|\mathcal {B}|], &{} \text {avg}(\mathcal {S}(\mathcal {B}; \theta )) {\ge } \epsilon _{\text {ccs}}, \\ \mathop {\textit{random}}(m, |\mathcal {B}|), &{} \text {otherwise}, \end{array} \right. \end{aligned}$$
(8)

where \(\mathop {\text {argmax}^{(m)}}\limits _{n} \) select m indexes of the examples with top-m confidence scores from \(n \in \{1, 2,\ldots , |\mathcal {B}|\}\), \(\mathcal {B}\) is the current batch, \(m = p \times |\mathcal {B}|, p \in (0, 1]\) is the ratio of selected indexes, and \(\mathop {\textit{random}}(m, |\mathcal {B}|)\) is a function that randomly select m numbers from \(\{1, 2,\ldots , |\mathcal {B}|\}\) without replacement. \(\epsilon _\text {ccs}\) is a factor that determines when the representative samples are convincing.

Therefore, the coreset \(\mathcal {C}\) can be updated as following

$$\begin{aligned} \mathcal {C} \leftarrow \textit{reservoir}(\mathcal {C}, \mathcal {B}[{\textbf {idx*}}], \hat{Y}_{\mathcal {B}}[{\textbf {idx*}}]), \end{aligned}$$
(9)

where \(\textit{reservoir}\) denotes the operation of reservoir sampling, \({\textbf {idx*}}\) is obtained based on Eq. (8), \(\hat{Y}_{\mathcal {B}}\) are the corresponding logits of the current batch. The full algorithm is shown in Algorithm 1.

4.4 Discussion

Our work is related to Liu et al. (2021) and Hou et al. (2019). However, our method differs from theirs in many aspects. First, our method does not rely on the oracle of task boundary, i.e., knowing the end of the task, to obtain a balanced coreset. Unlike Liu et al. (2021) and Hou et al. (2019) that rely on the task boundaries to obtain the balanced coreset, our proposed method does not obtain the balanced coresets directly but uses the confidence score to detect the training stage and judge the balance of the coresets. Second, both intra-process and standard-process update the network parameters and do not use additional parameters or fix parameters. For example, Liu et al. (2021) uses additional scaling weights at a neuron level and the aggregation weights. Third, in our method, we alternate standard-process and intra-process in a dynamic frequency. However, Hou et al. (2019) applies the class balance finetuning at the end of the task (phase). Liu et al. (2021) alternates the two optimization process at each iteration.

5 Experiments

In this section, we first describe the experimental setup and implementation. Then, we evaluate the continual learning algorithms on two protocols: Task-Aware and Task-Free. We also conduct ablation studies to explore the effect of different factors and show more results.

5.1 Experimental Setup and Implementation

Settings Based on the fact that whether the task identities are provided to select the relevant classifier for each image during testing, online CL can be divided into two protocols (Pham et al., 2021): Task-Aware and Task-Free, where the latter is more challenging because the task identities are unavailable at inference time.

Benchmarks Following previous works (Buzzega et al., 2020; Madaan et al., 2022; Buzzega et al., 2021), we evaluate the algorithms on four standard continual learning benchmarks: Split MNIST (S-MNIST), Split CIFAR-10 (S-CIFAR-10), Split CIFAR-100 (S-CIFAR-100), and Split TinyImageNet (S-TinyImageNet). Split MNIST and Split CIFAR-10 split the training examples of MNIST (LeCun et al., 1998) and CIFAR-10 (Krizhevsky et al., 2009) into five tasks, respectively. Each task has two disjoint classes. Split CIFAR-100 consists of 20 tasks, each of which introduces 5 classes out of the 100 classes of CIFAR-100 (Krizhevsky et al., 2009) without replacement. TinyImageNet (Stanford, 2015) consists of 100000 64 \(\times \) 64 color training samples and 10000 validation images. Similarly, Split TinyImageNet is constructed by 10 sequential tasks divided from TinyImageNet. Each task has 20 disjoint classes out of the total 200 classes without replacement.

Architectures Adhere to previous works (Buzzega et al., 2020; Mirzadeh et al., 2020; Jin et al., 2021), for Split MNIST, we employ a two-layer fully connected network, where each hidden layer has 100 ReLU units. For the variants of CIFAR-10 and CIFAR-100, we employ a lightweight ResNet-18 with three times smaller than standard ResNet-18. For Split TinyImageNet, we use the standard ResNet-18 (He et al., 2016). All tasks share the same classifier, i.e., we use a single-head setting, a more challenging setting.

Baselines We compare TRAF with the following 10 methods: ER, MER (Riemer et al., 2019), AGEM (Chaudhry et al., 2018b), GSS (Aljundi et al., 2019b), UCL (Madaan et al., 2022), DER  (Buzzega et al., 2020), RE-ER (Buzzega et al., 2021), DER++ (Buzzega et al., 2020), Complementary Learning System-ER (CLS-ER) (Arani et al., 2022)Footnote 1), and Continual Normalization (CN) (Pham et al., 2022). We also provide the performance of SGD (Ghadimi and Lan, 2013), which simply trains the model without any countermeasure to forgetting.

Fig. 4
figure 4

The curves of the average accuracy when the network has been trained on each task on the datasets of Split CIFAR-100 and Split TinyImageNet over five runs. [\(\uparrow \)] denotes higher is better

Evaluation Metric Following previous works (Mirzadeh et al., 2020; Lopez-Paz &Ranzato, 2017; Chaudhry et al., 2018b), we evaluate continual learning algorithms with two metrics: Average Accuracy (ACC) and Forgetting (FT). Formally, after the model has finished learning all tasks, ACC is the average accuracy evaluated across all observed tasks, defined as, \( \text {ACC} = \frac{1}{T}\sum _{i = 1}^{T} a_{Ti},\) where \(a_{ti}\) is the accuracy of the task \(\mathcal {T}_{i}\) when the model has been learned on the task \(\mathcal {T}_{t}\). FT measures the performance degradation of tasks from the task’s peak performance to its final performance, i.e., \(\text {FT} = \frac{1}{T-1}\sum _{i=1}^{T-1}\text {max}_{t \in [T-1]} (a_{ti} - a_{Ti}) \). Higher ACC and lower FT are better. With similar ACC, the algorithm with lower FT is better.

Implementation Details We use PytorchFootnote 2 to implement the proposed algorithm and other experiments. We use the SGD optimizer and batch size of 10 for all experiments. Adhering to previous work (Buzzega et al., 2020), the coreset size of Split MNIST and Split CIFAR-10 is 200 and 500, respectively. For Split CIFAR-100 and Split TinyImageNet, the coreset size is 1000. The learning rate for all experiments is 0.03.Footnote 3 For the method-related hyperparameters of all baselines, e.g., \(\alpha \) in DER++ and so on, we refer to the setting of the released code. CN is used on top of Experience Replay (ER). For the proposed method, the batch size for intra-process are 50 for 5-Split MNIST and 5-Split CIFAR-10, and 100 for other datasets. The sample ratio p is 0.9 for 5-Split MNIST and 5-Split CIFAR-10, and 0.8 for other datasets. We set \(\tau \) to 2 for Split MNIST and 1 for other datasets. The \(inv_{\text {max}}\) for all datasets is 5, and the \(inv_{\text {min}}\) is 1 for Split CIFAR10, and 2 for other datsets. Other hyperparameters settings are shown in Table 1. The loss function is cross-entropy loss. We perform all experiments five times with different random seeds, and the results are the average results over five runs. We use reservoir sampling strategy (Vitter, 1985) for all baselines applicable to online CL.

Table 1 The hyperparameter settings
Table 2 Comparisons of constant frequency and adaptive frequency
Table 3 Results of ACC (%) [\(\uparrow \)] and forgetting (%) [\(\downarrow \)] evaluated on all tasks after finishing learning all tasks on the setting of Task-Free
Table 4 Results of ACC (%) [\(\uparrow \)] and forgetting (%) [\(\downarrow \)] evaluated on all tasks after finishing learning all tasks on the setting of Task-Aware
Table 5 Results of ACC (%) [\(\uparrow \)] evaluated on all tasks after finishing learning all tasks on the setting of Task-Aware (T-Aware) and Task-Free (T-Free)

5.2 Experimental Results

The effect of trust-region adaptive frequency We first assess the ACCs and FTs of constant frequencyFootnote 4 and adaptive frequency on the setting of Task-Free, a more restrictive and challenging scenario. For a fair comparison, we exclude the component of CCS and DKD. According to Table 2, using adaptive frequency can achieve higher ACCs (relative improvement of at least 4.38%) and lower FTs than using constant frequency (any integer between the range [2, 5]), validating that using trust-region adaptive frequency could achieve better performance and less catastrophic forgetting.

Table 6 Results of ACC (%) [\(\uparrow \)] evaluated on all tasks after finishing learning all tasks on the setting of Task-Free in online continual learning. 2 Split, 5 Split, 10 Split, and 20 Split divide all 100 classes of CIRAR-100 into 2, 5, 10, and 20 splits, respectively

Comparisons with baselines on the setting of Task-Free Table 3 summarizes the experimental results of ACCs and FTs on the protocol of Task-Free. According to Table 3, the proposed method could outperform baselines by a considerable margin. For example, the ACCs of TRAF are at least 1.0% higher than that of other methods on all benchmarks. Especially, on Split CIFAR-100, the ACC of TRAF achieves at least 3 % improvement over SOTA (CLS-ER). One reason for the worse performance of other methods may be that previous works are not suitable for the realistic and challenging scenarios. For instance, UCL requires multiple accesses to the datasets to learn invariant features between all tasks, while the insufficient learning stemming from the online setting prevents them from learning such representations. Therefore, when applied to the online scenario, its performance is poorer than that of our method. Moreover, Table 3 shows that the proposed method has lower or comparable forgetting with baselines, demonstrating the effectiveness of TRAF in alleviating forgetting. For example, on Split CIFAR-10, the FT of TRAF is at least 2.00 % lower than other methods. On the Split TinyImageNet, the forgetting of AGEM is lower than our method. However, the ACC of AGEM is significantly lower than that of ours (10% lower).

Comparisons with baselines on the setting of Task-Aware Table 4 show the ACC and FT results in the protocol of Task-Aware. Similar to the setting of Task-Free, the proposed method could achieve higher performance with considerable forgetting than other methods. For instance, on the Split CIFAR100, the ACC of TRAF is 75.00%, largely higher than the best performance of baselines, i.e., 70.47%. On the Split TinyImageNet, the forgetting (FT) of TRAF is better than other methods, except for UCL. However, for UCL, its ACC is significantly lower than ours (over 30%). It is because that UCL uses the unsupervised learning loss to train the network and requires sufficient learning to learn the representations well. Thus, it performs poorly in the online fashion.

The accuracy curves Fig. 4 shows the curves of average accuracy evaluated on the observed tasks when the network has been trained on each task on the datasets of Split CIFAR100 and Split TinyImageNet, respectively. According to Fig. 4, the performance of our method is higher than baselines continually, further validating the superiority of the proposed method in alleviating the stability-plasicity dilemma.

Table 7 Results of ACC (%) and FT (%)

Combining with more CL methods As shown in Tables 5, when combined with our method, the performance of combining methods sometimes could be higher than TRAF. For example, SI+TRAF could obtain higher performance on S-MNIST (Task Free) and S-TinyImageNet (Task Aware).

Comparison with more recent works According to Table 6, compared to DER’ (Yan et al., 2021) and FOSTER (Wang et al., 2022a), our proposed method achieves the best performance with a significant margin. Moreover, according to Table 7, CCS can achieve better performance and lower forgetting than rainbow memory (RM), validating the superiority of CCS in selecting data.

Table 8 Effect of each component
Table 9 Ablation studies for t-test
Fig. 5
figure 5

The effect of \(\beta \) and \(\lambda \)

5.3 Ablation Study and Analysis

Effect of each component We show the effect of each component. According to Table 8, both adding I-P and CCS could obtain higher ACCs and lower FTs. Especially, the ACC of adding I-P is 8.16% higher and the FT is 12.23% lower than S-P on the S-CIFAR-10, indicating the effectiveness of trust-region adaptive frequency. Similarly, adding the component of confidence-based coreset selection can achieve better stability-plasticity trade-off and less forgetting on all benchmarks. For example, on S-CIFAR-10, the improvement of adding CCS is 1.16% and 2.21% for accuracy and forgetting, respectively. Moreover, combing all components can further improve the performance and decrease forgetting.

Fig. 6
figure 6

The effect of \(\epsilon _{\text {fre}}\) and \(\epsilon _{\text {ccs}}\) in trust-region adaptive frequency

Effect of \(\beta \) and \(\lambda \) Fig. 5 shows that both too large a value or too small a value of the balance factor \(\beta \) result in poor performance. If \(\beta \) is too large, the model will pay too much attention to preventing forgetting, resulting in unsatisfactory learning of the current task. However, if \(\beta \) is too small, the model could not retain the past knowledge well, resulting in catastrophic forgetting. Similarly, too large or too small a value of the distillation factor \(\lambda \) will also result in an improper balance between the learning of coreset and knowledge distillation, leading to worse stability-plasticity trade-off.

Table 10 Results of ACC (%) evaluated on all tasks after finishing learning all tasks. Higher is better
Table 11 Ablation studies on rules

The effect of \(\epsilon _{\text {fre}}\) and \(\epsilon _{\text {ccs}}\) As shown in Fig. 6, when \(\epsilon _{\text {fre}}\) is close to zero, the interval is almost \(inv_{\text {min}}\), then the intra-process will be performed frequently at the beginning of learning of each task, where the data in coreset are most from old tasks, leading to worse performance. Or if \(\epsilon _{\text {fre}}\) is large, the interval is almost \(inv_{\text {max}}\) and the frequency is low. Then the model could not exploit the advantage of intra-process. Therefore, \(\epsilon _{\text {fre}}\) needs to be proper to achieve better performance. We also explore the effect of \(\epsilon _{\text {ccs}}\) in Fig. 6. According to Fig. 6, when \(\epsilon _{\text {ccs}}=1.0\), the model randomly selects data from the current batch. The selection strategy degenerates into reservoir, and the performance is worse. However, when \(\epsilon _{\text {ccs}}=0.0\), the model selects the examples with higher confidence scores all the time, the performance is also unsatisfactory because the confidence scores are not reliable when the model does not learn well.

Comparison with t-test Table 9 show the results of using t-test rule in Eq. (3). The experiments run five times. According to Table 9, we could find that using the average score is better than the t-test, validating the superiority of using the proposed rule.

The result of combining into one loss Table 10 shows the results of the baseline (OneLoss) that combines the two processes into one loss and optimizes it at every iteration. The results show that our method performs better than OneLoss, validating the essential of alternating between standard-process and intra-process.

Ablation studies on rules The results in Table 11 show that using the average score is better than all other strategies, validating the reasonability of using the average operation.

5.4 Selected Data

Figure 7 shows the comparison of randomly selected data and the data chosen by the proposed confidence-based coreset selection. We could find that the data selected by CSS are bolder and easier to distinguish the true class, i.e., more representative, validating the effectiveness of our selection strategy.

5.5 Running Time

Figure 8 shows the comparison of running time. The device is a single Nvidia Tesla V100 (16GB) GPU. The dataset is the Split CIFAR-100, and the results are the average results over five runs. According to Fig. 8, since our proposed method alternates the intra-process and standard-process, the running time is marginally larger than some baselines. However, we can find that the running time of the proposed method is also lower than some methods, e.g., GSS.

Fig. 7
figure 7

The data comparison of selecting by random strategy and Confidence-based Coreset Selection (CCS)

Fig. 8
figure 8

Comparison of running time (s). The device is a single Nvidia Tesla V100 (16GB) GPU. The dataset is the Split CIFAR-100 and the results are the average results over five runs. For a better comparison, we only show the range between [0, 800]

Fig. 9
figure 9

The changes of \(inv_k\) during training. The dataset is Split CIFAR-100. The smaller \(inv_k\), the higher the frequency of intra-process is

5.6 The Changes of \(inv_k\) During Training

Figure 9 shows how the frequency \(inv_k\) involves overtime. The larger \(inv_k\), the higher the frequency of intra-process is. Since the batch size is 10 and the datasize of one task is 2500. Therefore, new tasks arrive every 250 iterations. According to Fig. 9, we could find that \(inv_k\) would increase the frequency at the beginning of each task. And at the later stage of training of each task, the coreset store more data of the current task and as shown in Fig. 3, the coreset becomes more balanced. Therefore, \(inv_k\) decreases, i.e., the frequency of intra-process increases. The result shows that the frequency of the intra-process can be adjusted dynamically based on the learning stage, validating the effect of our method.

6 Conclusion

In this paper, we aim to relieve the stability-plasticity dilemma for continual learning, constraining that the data arrives in an online stream. We propose a new online continual learning approach, Trust-Region Adaptive Frequency (TRAF), which alternates between standard-process and intra-process updates in an adaptive frequency. Moreover, TRAF also retains useful knowledge through dark knowledge distillation and stores representative data based on confidence scores. Extensive experimental results validate the effectiveness of the proposed method on several benchmarks. For limitation, the proposed method is tailed for the online setting and may not show excellence in other continual learning settings, e.g., the offline setting. We would like to explore more realistic and challenging scenarios and more analytical analysis in the future. Moreover, studying other continual learning methods, e.g., regularization-based methods, in the online setting is also an interesting direction.