Trust-Region Adaptive Frequency for Online Continual Learning

In the paradigm of online continual learning, one neural network is exposed to a sequence of tasks, where the data arrive in an online fashion and previously seen data are not accessible. Such online fashion causes insufficient learning and severe forgetting on past tasks issues, preventing a good stability-plasticity trade-off, where ideally the network is expected to have high plasticity to adapt to new tasks well and have the stability to prevent forgetting on old tasks simultaneously. To solve these issues, we propose a trust-region adaptive frequency approach, which alternates between standard-process and intra-process updates. Specifically, the standard-process replays data stored in a coreset and interleaves the data with current data, and the intra-process updates the network parameters based on the coreset. Furthermore, to improve the unsatisfactory performance stemming from online fashion, the frequency of the intra-process is adjusted based on a trust region, which is measured by the confidence score of current data. During the intra-process, we distill the dark knowledge to retain useful learned knowledge. Moreover, to store more representative data in the coreset, a confidence-based coreset selection is presented in an online manner. The experimental results on standard benchmarks show that the proposed method significantly outperforms state-of-art continual learning algorithms.


Introduction
Continual learning (CL) is a learning paradigm that aims to mimic the human abilities of adapting to new environments while not forgetting past experience (Delange et al., 2021;Communicated Peng et al., 2021;Kong et al., 2022;Wang et al., 2022b, c, d). However, when the network is exposed to a sequence of tasks sequentially, the performance of the old tasks would drop significantly, which is referred to as catastrophic forgetting (McCloskey &Cohen, 1989;Ratcliff, 1990), a well-known challenge in continual learning. The problem closely involves the stability-plasticity dilemma (Grossberg, 1982;Mermillod et al., 2013), where with limited resources, the network is infeasible to have plasticity to learn a new task well and stability to maintain useful knowledge learned from past tasks simultaneously.
However, most of the approaches are tailed for offline setting, where the model could iterate the entire dataset of each task multiple times. That is, the model can access the whole dataset of the current task at a time and needs additional storage to store the whole current dataset, which is not realistic. Therefore, we consider a more restrictive but practical setting, online continual learning (online CL). Specifically, online CL requires continual learning algorithms to observe the data of each task in a single pass while previous data are unavailable.
Considering the excellent performance of replay-based methods in continual learning, in this paper, we devote to replay-based method that stores a subset of historical data in a coreset. The typical problem of replay-based methods is data imbalance where the data of the current task and the data of previous tasks are imbalanced due to the inaccessibility of the old data and the small size of coreset. Moreover, when applied to online fashion, replay-based methods further face more challenges, preventing them from achieving a good stabilityplasticity trade-off. For example, the model only observes the data stream from sequential tasks in a single pass, resulting in unsatisfactory learning of tasks (poor plasticity) and severe catastrophic forgetting (poor stability). Moreover, for online CL, the typical sampling strategy, reservoir randomly samples a uniform subset from the input stream and would omit the representative and informative data of old tasks, resulting in more forgetting of previous tasks. Therefore, to overcome the challenges, we propose a new online continual learning approach, Trust-Region Adaptive Frequency (TRAF), which alternates between standardprocess and intra-process updates based on a trust region. Specifically, the standard-process trains from data stored in a coreset and interleaves the data with current data, and the intra-process updates the network parameters based on the coreset. By triggering the intra-process during the standardprocess, the model could improve the performance of tasks stemming from insufficient learning and alleviate the forgetting on previous tasks simultaneously. Moreover, to alleviate the data imbalance, intra-process is better to be triggered more frequently in the stage that the coreset is more balanced. We propose a trust-region inspired approach (Nocedal and Wright, 2006;Conn et al., 2000;Cartis et al., 2011), measured by confidence score, to detect the stage and adjust the frequency of the intra-process based on the trust region. During intra-process, we further distill the dark knowledge to retain learned knowledge. Finally, considering the importance of storing data, we introduce a confidence-based coreset selec-tion to store more representative samples to further alleviate the forgetting. The full procedure of the proposed method is shown in Fig. 1.
The experimental results on different benchmarks demonstrate that TRAF could outperform existing competitive continual learning algorithms by a considerable margin. To summarize, our contributions are threefold: -For online CL, we propose a new online CL method, Trust-Region Adaptive Frequency (TRAF), which alternates between standard-process and intra-process updates based on a trust region, to relieve the stability-plasticity dilemma. -To further improve performance, TRAF also uses confidence-based coreset selection to select more representative data. -Extensive experimental results on two standard protocols and several standard benchmarks show that the proposed method could achieve state-of-art performance.

Continual Learning Approaches
Replay-based methods are a prominent class of continual learning approaches and achieve state-of-the-art performance in many challenging scenarios. Specifically, replaybased methods maintain a small memory buffer to store data and train the historical data interleaved with the new data at the latter training iterations (Rolnick et al., 2019;Isele &Cosgun, 2018;Chaudhry et al., 2019;Shin et al., 2017;Rao et al., 2019;Aljundi et al., 2019aAljundi et al., , 2017Hou et al., 2018;Ostapenko et al., 2019;Bang et al., 2021). For instance, Experience Replay (ER) is the most classical approach that jointly optimizes the model parameters by replaying the old data alongside new data. Gradient Episodic Memory (GEM) (Lopez-Paz &Ranzato, 2017) and Averaged-GEM (AGEM) (Chaudhry et al., 2018b) update the model under inequality constraints of gradients, which is computed by the gradients of the stored samples. Incremental Classifier and Representation Learning (iCaRL) (Rebuffi et al., 2017) learns in a class-incremental way by storing samples that are close to the center of each class. Rethinking-Experience Replay (RE-ER) (Buzzega et al., 2021) proposes several simple techniques to tackle the existing challenges in ER. Gradient based Sample Selection (GSS) (Aljundi et al., 2019b) focuses on the selection strategy and proposes a variation on ER from the view of constrained optimization. Meta-Experience Replay (MER) (Riemer et al., 2019) combines ER with optimizationbased meta-learning to maximize transfer from the past tasks while minimizing interference. DER++ (Buzzega et al., 2020) promotes consistency with the past by matching the Fig. 1 The full procedure of the proposed method. The procedure includes Standard-Process (S-P) and Intra-Process (I-P), where S-P updates the model based on the current data and coreset, and I-P updates based on coreset. The frequency of I-P is dependent on the confidence score S. When the average of the confidence score is higher than the threshold fre , the frequency is gradually increased. Otherwise, we decrease the frequency. Low frequency is referred to the situation where the number of standard-process is much larger than the I-P within a certain time and vice versa. Architecture-based methods expand the network progressively when needed or allocate different parameters for different tasks (Serra et al., 2018;Li et al., 2019;Zhou et al., 2012;Wu et al., 2020;Yoon et al., 2019). For example, PNN (Rusu et al., 2016) expands the networks when the new task comes, and retain the networks learned on past tasks. Wu et al. (2020) progressively and dynamically grows neural networks by jointly optimizing the network. However, these methods may result in a cumbersome and complex model if new tasks continually arrive.
Regularization-based methods add a penalty term to the loss function to prevent the changes of network parameters (Chaudhry et al., 2018a;Yin et al., 2020;Nguyen et al., 2017;Ritter et al., 2018;Lin et al., 2022). For example, EWC (Kirkpatrick et al., 2017), SI (Zenke et al., 2017), MAS (Aljundi et al., 2018), and ALASSO (Park et al., 2019) are devoted to the computation of parameters' importance while LwF (Li and Hoiem, 2017) aims to distill the knowledge without storing old data. However, these approaches may lead to unsatisfactory performance without access to previous data, especially in challenging scenarios.

Online Continual Learning
While the majority of continual learning methods are designed for unsuitable scenarios, where the model can iterate on the entire dataset of the current task multiple times (Zenke et al., 2017;Schwarz et al., 2018;Rusu et al., 2016;Rebuffi et al., 2017), online continual learning (online CL) has been gaining much interest recently due to its ubiquitous in many real-world problems. In this paper, we consider a challenging task that is more restrictive, i.e., online CL (Jin et al., 2021;Sun et al., 2022;Aljundi et al., 2019b). In online CL, the model observes the data of each task in a single pass and previous data are unavailable.
Moreover, recent works (Delange et al., 2021;Buzzega et al., 2020) also provided the requirements that the continual learning methods should focus on to be more applicable in practical: (a) no task boundaries: the model does not rely on the task boundaries. (b) constant memory: the memory is bounded throughout the entire training phase. (c) no test time oracle: the task identities which are used to select the relevant task for each image are not accessible at inference time. Our Fig. 2 The illustration of online data stream where the data of each task arrives sequentially and each data can be only observed once setting follows the guidelines and according to the fact that whether the task identities (time oracle) or not, we divide the scenario into two protocols: Task-Aware (with task identities) and Task-Free (without task identities) (Pham et al., 2021), and evaluate the proposed method on both protocols.

Problem Setting
In this section, we present the setting of online continual learning. Figure 2 shows the illustration of online CL. Formally, in online CL, the model is learned on a sequence of image classification tasks T = {T 1 , . . . , T T }, where T is the total number of tasks and T is the tasks set. For task T t , the input samples x and the corresponding labels y are drawn from the independently and identically distributed distribution of task T t . Let D t be the dataset of task T t , and D be the corresponding online data stream consisting of all datasets D t , t ∈ {1, 2, . . . , T }, sequentially. Note that the task boundaries are not provided to indicate the coming of a new task during training. The model is trained on a sequence of batches {B 1 , B 2 , ...} from D with each data seen once.
Let θ be the model parameters and N be the network. In this paper, we focus on replay-based methods, a prominent class of approaches in continual learning, which store a subset of past data in a limited replay coreset C and replay the data in the future (Buzzega et al., 2020;Madaan et al., 2022;Buzzega et al., 2021). |A| denotes the datasize of A.

Methodology
In this section, we depict the proposed method, Trust-Region Adaptive Frequency (TRAF), which alternates between standard-process and intra-process updates in an adaptive frequency. We first describe the standard-process and intraprocess (Sect. 4.1) and then introduce the trust-region adaptive frequency for intra-process (Sect. 4.2), the key idea of our work. To further alleviate the catastrophic forgetting, we also propose confidence-based coreset selection to select more

Algorithm 1 Trust-Region Adaptive Frequency in Online Continual Learning
Input: Network N , Parameters θ, Data stream D, Learning rate η, Scalars m, λ, δ, fre , inv min , inv max , ccs , inv 0 and β, Output: . Finally, we discuss the difference between our work and some related works (Sect. 4.4).

Standard-Process
In this subsection, we first introduce the experience replay (ER), the most typical replay-based method, which stores a subset of historical data across encountered tasks and optimizes the network with the historical data and current data during training. Formally, when training on the current batch B, the objective can be represented as the following: where L is cross entropy loss, C is the coreset containing the stored training samples, θ are the model parameters, and β is a factor that controls the balance between the new task and past tasks. We call the updates in Eq. (1) as Standard-Process (S-P). ER and its variants (Buzzega et al., 2020;Arani et al., 2022;Buzzega et al., 2021) have achieved impressive achievements in conventional continual learning. However, they are still hard to achieve satisfactory stability-plasticity trade-off on the online CL. Specifically, the model observes batch B sequentially, where each batch is seen once, resulting in insufficient learning of the current task (poor plasticity). Moreover, except for the historical data stored in the coreset, the model can only access the data of the current task, leading to more attention to the classes in the current task. The phenomenon would result in more severe catastrophic forgetting of previous classes (poor stability).

Intra-Process
Inspired by previous works that multiple iterations on the data can improve the unsatisfactory performance (Tang et al., 2021), to maximally utilize the limited data in the coreset, we introduce a new process, Intra-Process (I-P), updating the model parameters under the coreset. Formally, the loss function for the intra-process is L(C; θ), where L is the cross entropy loss, C is the coreset containing the stored training samples.
To improve the insufficient learning of tasks, we alternate the standard-process with intra-process throughout the optimization trajectory. Specifically, we trigger the intra-process at a certain frequency during the training of the standardprocess, where the frequency corresponds to the triggers of intra-process in certain iterations. We define the trigger function as where mod is the operation of modulo, k is the current iteration number and inv is an integer that controls the frequency and it is negatively related to the frequency, i.e., larger inv corresponds to lower frequency.

Trust-Region Adaptive Frequency for Intra-Process
If intra-process updates parameters based on the more balanced coreset, it can alleviate the negative impact brought by data imbalance. However, due to the online setting, the class balance in the coreset varies at the different learning stages of the current task. As found in our experiments and shown in Fig. 3, the class distribution in coreset is more uniform for all observed classes at the late stage of the current task learning. Therefore, intra-process should be triggered more frequently at the late stage of the current task to alleviate the negative impact brought by data imbalance better. However, in online learning, the boundaries of the task are not accessible thus we could not obtain the learning stage of the current task. To this end, we designed an approach to detect the late stage of the current task. We find that, due to online fashion that data can be seen only once, the performance of the model on the current task would be better at the later training stage of the current task. Therefore, we use the performance of the current task to detect the latter stage of the current task. A natural way to measure performance is the confidence score, where a higher confidence score represents better performance. Therefore, to detect the stage, we propose a trust-region that is measured by a confidence score and Fig. 3 The illustration of distribution changes in the coreset during the training of a task. Assume that the class of the current task is Airplane. The samples in the coreset are all from old tasks at the beginning of current task learning and then the fraction of the samples from the current task will gradually increase until the end of the task. Finally, the samples of each class will be almost equal adjust the frequency dynamically based on the trust region, called trust-region adaptive frequency. Specifically, to represent the region explicitly, we use the average confidence score, which is the predicted probability of the ground truth label, for the current batch to measure the performance of the current task. When the model is under the trust region, we increase the frequency by decreasing the factor inv and we decrease the frequency if the model is outside the region. Higher score is trusted because it represents better performance, a later stage of current task learning, and a more balanced coreset. Therefore, let S(x; θ) be the confidence score of x. Then the candidate of inv k can be updated by where avg(·) denotes the average function, fre is the threshold of the score avg(S(B; θ)), k is the current iteration number, δ is the amplitude of the frequency update and B is the current batch. After obtaining the inv k , we round up or round down it to obtain the inv k used in the trigger function: where · and · denote the operations of rounding down and rounding up, respectively; inv max and inv min are the maximum and minimum value of inv k , respectively. As shown in Fig. 1, when the average confidence score of the current batch is satisfactory, i.e., higher than the threshold fre , we decrease the inv and the corresponding frequency of intra-process is increased. Otherwise, we increase inv and then the frequency is decreased. Note that fre is an important factor because it determines the trust region. For example, when fre is large, the performance of the current task is better, and the classes in the coreset are more balanced. However, in the situation, most of the region is in the untrust region, and the triggers of intra-process are lower through the optimization trajetory, impacting the performance of the model. Moreover, fre is related to the complex of dataset. When the dataset is easy to learn, then fre should be a larger value since the worse case can also be well classified.
To further relieve forgetting and maintain the useful knowledge learned from the past, we distill the dark knowledge (Buzzega et al., 2020;Gou et al., 2021;Zhao et al., 2021;Wang et al., 2020), called Dark Knowledge Distillation (DKD), during the intra-process. Specifically, we retain the network's logits and use the modified cross-entropy loss as the distillation loss. During intra-process, we sample the examplers (x,ỹ C ) from the coreset randomly, whereŷ C is the record logits of x. Then distillation loss can be represented as: Algorithm 2 reservoir(C, x, y) , L is the total number of classes, τ is the temperature factor, andỹ C andŷ C are the record and current logits of x.
To the end, the training procedure can be represented as Intra-Process : where λ is a factor that controls the importance of distillation; Y C andŶ C are the recorded and current logits of examples randomly sampled from the coreset C, respectively; β and λ are balanced hyperparameters which are commonly used in CL (Buzzega et al., 2020). The intra-process is happened when I(k, inv) = 1 (defined in Eq. 2). The procedure is shown in Algorithm 1.

Confidence-Based Coreset Selection
For replay-based methods, especially in online CL, a key problem is how to choose representative data that are beneficial for future rehearing. A compatible selection strategy for online CL is the reservoir (Vitter, 1985), which randomly smaples a uniform subset from the input stream. Specifically, reservoir randomly chooses C = |C| samples to store in the coreset C, guaranteeing that all seen samples have the same probability C N of being stored in the coreset, where N is the number of seen samples participating in the reservoir sample strategy. The algorithm of reservoir is shown in Algorithm 2, where random I nteger(min = 0, max = N ) denotes the operation that randomly selects an integer between 0 and N − 1.
However, reservoir puts equal importance on all samples, which does not take data representation into consideration. Therefore, we design a simple but effective sampling strategy that could store more representative data, called Confidencebased Coreset Selection (CCS), by storing data with higher confidence scores in an online manner. The confidence-based coreset selection relies on the confidence score to select the samples. However, at the early stage of each task learning, the confidence scores are unreliable because the model does not fit well with the current task. Therefore, we only selectively choose the samples based on the confidence score when the confidence score is reliable, i.e., the average confidence score is higher than a threshold. Or we randomly select the samples to avoid negative effect the brought by the unreliable confidence score. Formally, the indexes idx * of the selected data for the current batch B can be formulated as . . , |B|} without replacement. ccs is a factor that determines when the representative samples are convincing. Therefore, the coreset C can be updated as following where reservoir denotes the operation of reservoir sampling, idx* is obtained based on Eq. (8),Ŷ B are the corresponding logits of the current batch. The full algorithm is shown in Algorithm 1.

Discussion
Our work is related to Liu et al. (2021) and Hou et al. (2019). However, our method differs from theirs in many aspects. First, our method does not rely on the oracle of task boundary, i.e., knowing the end of the task, to obtain a balanced coreset. Unlike Liu et al. (2021) and Hou et al. (2019) that rely on the task boundaries to obtain the balanced coreset, our proposed method does not obtain the balanced coresets directly but uses the confidence score to detect the training stage and judge the balance of the coresets. Second, both intra-process and standard-process update the network parameters and do not use additional parameters or fix parameters. For example, Liu et al. (2021) uses additional scaling weights at a neuron level and the aggregation weights. Third, in our method, we alternate standard-process and intra-process in a dynamic frequency. However, Hou et al. (2019) applies the class balance finetuning at the end of the task (phase). Liu et al. (2021) alternates the two optimization process at each iteration.

Experiments
In this section, we first describe the experimental setup and implementation. Then, we evaluate the continual learning algorithms on two protocols: Task-Aware and Task-Free. We also conduct ablation studies to explore the effect of different factors and show more results.

Experimental Setup and Implementation
Settings Based on the fact that whether the task identities are provided to select the relevant classifier for each image during testing, online CL can be divided into two protocols (Pham et al., 2021): Task-Aware and Task-Free, where the latter is more challenging because the task identities are unavailable at inference time. Architectures Adhere to previous works (Buzzega et al., 2020;Mirzadeh et al., 2020;Jin et al., 2021), for Split MNIST, we employ a two-layer fully connected network, where each hidden layer has 100 ReLU units. For the variants of CIFAR-10 and CIFAR-100, we employ a lightweight ResNet-18 with three times smaller than standard ResNet-18. For Split TinyImageNet, we use the standard ResNet-18 (He et al., 2016). All tasks share the same classifier, i.e., we use a single-head setting, a more challenging setting.  Continual Normalization (CN) (Pham et al., 2022). We also provide the performance of SGD (Ghadimi and Lan, 2013), which simply trains the model without any countermeasure to forgetting.
Evaluation Metric Following previous works (Mirzadeh et al., 2020;Lopez-Paz &Ranzato, 2017;Chaudhry et al., 2018b), we evaluate continual learning algorithms with two metrics: Average Accuracy (ACC) and Forgetting (FT). Formally, after the model has finished learning all tasks, ACC is the average accuracy evaluated across all observed tasks, defined as, ACC = 1 T T i=1 a T i , where a ti is the accuracy of the task T i when the model has been learned on the task T t . FT measures the performance degradation of tasks from the task's peak performance to its final performance, i.e., FT = 1 Higher ACC and lower FT are better. With similar ACC, the algorithm with lower FT is better.

Implementation Details
We use Pytorch 2 to implement the proposed algorithm and other experiments. We use the SGD optimizer and batch size of 10 for all experiments. Adhering to previous work (Buzzega et al., 2020), the coreset size of Split MNIST and Split CIFAR-10 is 200 and 500, respectively. For Split CIFAR-100 and Split TinyImageNet, the coreset size is 1000. The learning rate for all experiments is 0.03. 3 For the method-related hyperparameters of all baselines, e.g., α in DER++ and so on, we refer to the setting of the released code. CN is used on top of Experience Replay (ER). For the proposed method, the batch size for intra-process are 50 for 5-Split MNIST and 5-Split CIFAR-10, and 100 for other datasets. The sample ratio p is 0.9 for 5-Split MNIST and 5-Split CIFAR-10, and 0.8 for other datasets. We set τ to 2 for Split MNIST and 1 for other datasets. The inv max for all datasets is 5, and the inv min is 1 for Split CIFAR10, and 2 for other datsets. Other hyperparameters settings are shown in Table 1. The loss function is cross-entropy loss. We perform all experiments five times with different random seeds, and the results are the average results over five runs. We use reservoir sampling strategy (Vitter, 1985) for all baselines applicable to online CL.

Experimental Results
The effect of trust-region adaptive frequency We first assess the ACCs and FTs of constant frequency 4 and adaptive frequency on the setting of Task-Free, a more restrictive and challenging scenario. For a fair comparison, we exclude the component of CCS and DKD. According to Table 2, using adaptive frequency can achieve higher ACCs (relative improvement of at least 4.38%) and lower FTs than using constant frequency (any integer between the range [2, 5]), validating that using trust-region adaptive frequency could achieve better performance and less catastrophic forgetting. Table  3 summarizes the experimental results of ACCs and FTs on the protocol of Task-Free. According to Table 3, the proposed method could outperform baselines by a considerable margin. For example, the ACCs of TRAF are at least 1.0% higher than that of other methods on all benchmarks. Especially, on Split CIFAR-100, the ACC of TRAF achieves at least 3 % improvement over SOTA (CLS-ER). One reason for the worse performance of other methods may be that previous works are not suitable for the realistic and challenging scenarios. For instance, UCL requires multiple accesses to the datasets to learn invariant features between all tasks, while the insufficient learning stemming from the online setting prevents them from learning such representations. Therefore, when applied to the online scenario, its performance is poorer than that of our method. Moreover, Table 3 shows that the proposed method has lower or comparable forgetting with baselines, demonstrating the effectiveness of TRAF in alleviating forgetting. For example, on Split CIFAR-10, the FT of TRAF is at least 2.00 % lower than other methods. On the Split TinyImageNet, the forgetting of AGEM is lower than   The bold values denote the best performance '-' indicates experiments we were unable to run, due to compatibility issues or intractable training time.

Comparisons with baselines on the setting of Task-Free
[↑] Higher is better and [↓] lower is better our method. However, the ACC of AGEM is significantly lower than that of ours (10% lower). Table 4 show the ACC and FT results in the protocol of Task-Aware. Similar to the setting of Task-Free, the proposed method could achieve higher performance with considerable forgetting than other methods. For instance, on the Split CIFAR100, the ACC of TRAF is 75.00%, largely higher than the best performance of baselines, i.e., 70.47%. On the Split TinyImageNet, the forgetting (FT) of TRAF is better than other methods, except for UCL. However, for UCL, its ACC is significantly lower than ours (over 30%). It is because that UCL uses the unsupervised learning loss to train the network and requires sufficient learning to learn the representations well. Thus, it performs poorly in the online fashion.

Comparisons with baselines on the setting of Task-Aware
The accuracy curves Fig. 4 shows the curves of average accuracy evaluated on the observed tasks when the network has been trained on each task on the datasets of Split CIFAR100 and Split TinyImageNet, respectively. According to Fig. 4, the performance of our method is higher than baselines continually, further validating the superiority of the proposed method in alleviating the stability-plasicity dilemma.
Combining with more CL methods As shown in Tables 5, when combined with our method, the performance of combining methods sometimes could be higher than TRAF. For The bold values denote the best performance '-' indicates experiments we were unable to run, due to compatibility issues or intractable training time.
[↑] Higher is better and [↓] lower is better The bold values denote the best performance The bold values denote the best performance The classes of each task are disjoint. The memory size is 1000.
[↑] Higher is better The bold values denote the best performance The experiments run three times. The memory size for Split CIFAR-10 and Split CIFAR-100 are 500 and 1000, respectively. 5-Split CIFAR-10 and 5-Split CIFAR-100 divide the total classes of CIFAR-10 and CIFAR-100 into five tasks, respectively. The classes of each task are disjoint example, SI+TRAF could obtain higher performance on S-MNIST (Task Free) and S-TinyImageNet (Task Aware).
Comparison with more recent works According to Table 6, compared to DER' (Yan et al., 2021) and FOSTER (Wang et al., 2022a), our proposed method achieves the best performance with a significant margin. Moreover, according to Table 7, CCS can achieve better performance and lower forgetting than rainbow memory (RM), validating the superiority of CCS in selecting data. The bold values denote the best performance The experiments are average results of five runs on the setting of Task-Free. "I-P/D" denotes intra-process without DKD The bold values denote the best performance The experiments run five times

Ablation Study and Analysis
Effect of each component We show the effect of each component. According to Table 8, both adding I-P and CCS could obtain higher ACCs and lower FTs. Especially, the ACC of adding I-P is 8.16% higher and the FT is 12.23% lower than S-P on the S-CIFAR-10, indicating the effectiveness of trust-region adaptive frequency. Similarly, adding the component of confidence-based coreset selection can achieve better stability-plasticity trade-off and less forgetting on all benchmarks. For example, on S-CIFAR-10, the improvement of adding CCS is 1.16% and 2.21% for accuracy and forgetting, respectively. Moreover, combing all components can further improve the performance and decrease forgetting. Effect of β and λ Fig. 5 shows that both too large a value or too small a value of the balance factor β result in poor performance. If β is too large, the model will pay too much attention to preventing forgetting, resulting in unsatisfactory learning of the current task. However, if β is too small, the model could not retain the past knowledge well, resulting in catastrophic forgetting. Similarly, too large or too small a value of the distillation factor λ will also result in an improper balance between the learning of coreset and knowledge distillation, leading to worse stability-plasticity trade-off.
The effect of fre and ccs As shown in Fig. 6, when fre is close to zero, the interval is almost inv min , then the intraprocess will be performed frequently at the beginning of learning of each task, where the data in coreset are most from old tasks, leading to worse performance. Or if fre is large, the interval is almost inv max and the frequency is low. Then the model could not exploit the advantage of intraprocess. Therefore, fre needs to be proper to achieve better performance. We also explore the effect of ccs in Fig. 6. According to Fig. 6, when ccs = 1.0, the model randomly selects data from the current batch. The selection strategy degenerates into reservoir, and the performance is worse. However, when ccs = 0.0, the model selects the examples with higher confidence scores all the time, the performance is also unsatisfactory because the confidence scores are not reliable when the model does not learn well.
Comparison with t-test Table 9 show the results of using t-test rule in Eq.
(3). The experiments run five times. According to Table 9, we could find that using the average score is better than the t-test, validating the superiority of using the proposed rule.
The result of combining into one loss Table 10 shows the results of the baseline (OneLoss) that combines the two processes into one loss and optimizes it at every iteration. The results show that our method performs better than OneLoss, validating the essential of alternating between standardprocess and intra-process.
Ablation studies on rules The results in Table 11 show that using the average score is better than all other strategies, validating the reasonability of using the average operation. The bold values denote the best performance  Figure 7 shows the comparison of randomly selected data and the data chosen by the proposed confidence-based coreset selection. We could find that the data selected by CSS are bolder and easier to distinguish the true class, i.e., more representative, validating the effectiveness of our selection strategy. Figure 8 shows the comparison of running time. The device is a single Nvidia Tesla V100 (16GB) GPU. The dataset is the Split CIFAR-100, and the results are the average results over five runs. According to Fig. 8, since our proposed method alternates the intra-process and standard-process, the running time is marginally larger than some baselines. However, we can find that the running time of the proposed method is also lower than some methods, e.g., GSS. The changes of inv k during training. The dataset is Split CIFAR-100. The smaller inv k , the higher the frequency of intra-process is Figure 9 shows how the frequency inv k involves overtime. The larger inv k , the higher the frequency of intra-process is.

The Changes of inv k During Training
Since the batch size is 10 and the datasize of one task is 2500. Therefore, new tasks arrive every 250 iterations. According to Fig. 9, we could find that inv k would increase the frequency at the beginning of each task. And at the later stage of training of each task, the coreset store more data of the current task and as shown in Fig. 3, the coreset becomes more balanced. Therefore, inv k decreases, i.e., the frequency of intra-process increases. The result shows that the frequency of the intra-process can be adjusted dynamically based on the learning stage, validating the effect of our method.

Conclusion
In this paper, we aim to relieve the stability-plasticity dilemma for continual learning, constraining that the data arrives in an online stream. We propose a new online continual learning approach, Trust-Region Adaptive Frequency (TRAF), which alternates between standard-process and intra-process updates in an adaptive frequency. Moreover, TRAF also retains useful knowledge through dark knowledge distillation and stores representative data based on confidence scores. Extensive experimental results validate the effectiveness of the proposed method on several benchmarks. For limitation, the proposed method is tailed for the online setting and may not show excellence in other continual learning settings, e.g., the offline setting. We would like to explore more realistic and challenging scenarios and more analytical analysis in the future. Moreover, studying other continual learning methods, e.g., regularization-based methods, in the online setting is also an interesting direction.
Funding Open Access funding enabled and organized by CAUL and its Member Institutions.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.