1 Introduction

Relation extraction (RE) [1], as a fundamental task in Natural Language Processing, has been widely used in many downstream tasks such as knowledge graphs (KGs) [2], question answering system (QA) [3], etc. Its primary objective is to detect relations between two or more entities in a text. For instance, given the sentence "Beijing is the capital of the China," the model needs to identify the relation "capital of" between the entity pair [Beijing, the China].

Traditional RE methods typically assume that the relations to be predicted belong to a fixed set of predefined relations. They train a model once on the fixed dataset, without considering real-world persistence and iteration. In order to make the model applicable to real-world scenarios, scholars have introduced continual learning (CL) [4,5,6] into relation extraction and proposed continual relation extraction (CRE). Compared with traditional RE, CRE aims to help models learn new relations while maintaining accurate classification of old relations. However, since neural networks suffer from catastrophic forgetting (CF) [7] problem, which makes the new parameters learnt by the model from the new task overwrite the old ones, causing its performance on the old task to drop off a cliff. In order to achieve CRE, it is necessary to address the catastrophic forgetting problem. Recent work has focused on solving this problem using memory-based approaches.

Memory-based approaches usually store some typical training samples as memory samples for old relations and replay them during subsequent learning of new relations to avoid forgetting. However, during memory replay, models often suffer from overfitting due to the relatively small number of memory samples. In 2020, Han et al. [8] introduced an episodic memory activation and reconsolidation method (EMAR) to CRE. After memory replay and activation, it reconsolidates all known relations using the memory sample set. This approach effectively mitigates overfitting and strengthens the model's long-term memory of old relations. In 2021, Wu et al. [9] combined curriculum learning and meta-learning, proposing Curriculum-Meta Learning (CML). It selectively reduces the replay frequency of memory samples to prevent overfitting and guides the model to learn the deviation between the current task and the most similar previous task, reducing sequence sensitivity.

Regarding memory-based methods, the selection and utilization of memory samples are crucial for mitigating catastrophic forgetting. In 2021, Cui et al. [10] introduced prototype networks into CRE to make better use of memory samples' information. They improved sample embeddings with relation prototypes and proposed prototype refinement to effectively utilize the information stored in memory, reducing the model's dependence on the number of memory samples and enhancing its performance. In 2022, Zhao et al. [11] introduced contrastive learning into CRE and proposed a consistent representation learning method (CRL). It constrained embeddings of old tasks not to change significantly using supervised contrastive learning and knowledge distillation. Hu et al. [12] found experimentally that catastrophic forgetting of the model leads to indistinguishable data distributions of the old and new tasks, and then designed a CRE framework consisting of a classification network and a contrast network (CRECL). In the contrast network, a given example is compared with each candidate relation prototype to make full use of the negative correlation information and improve the consistency of the data distribution, thus mitigating catastrophic forgetting.

Most methods attribute catastrophic forgetting to the damage caused to representations learned for old relations when new relations appear. This implicitly assumes that models have already learned old relations sufficiently. However, in 2022, Wang et al. [13] empirically found that this assumption may not hold true. They proposed an adversarial class augmentation mechanism (ACA) to enhance advanced CRE models like EMAR. The introduced mechanism helps the model learn more precise and robust representations. Successively, Zhao et al. [14] introduced integrated training and focal knowledge distillation in order to improve the performance of the model on similar relations, and designed a memory-insensitive relation prototype and a memory enhancement strategy to overcome the memory sample overfitting problem.

The aforementioned methods alleviate catastrophic forgetting by addressing issues such as overfitting of memory samples and improving memory sample utilization. However, they overlook the inherent problems in semantic embeddings generated by BERT encoders, such as anisotropy and uneven distribution. Additionally, they do not sufficiently consider the impact of memory samples on model performance. Therefore, in our work, we build upon the EMAR model with the ACA mechanism (EMAR-ACA) and propose a continual relation extraction method called SS-CRE (Supervised SimCSE-BERT and Static Relation Prototypes for Continual Relation Extraction). Our main contributions are as follows:

  • (1) We employed the supervised SimCSE-BERT [15] as the foundational structure for the encoder. Leveraging its contrastive learning framework, we made semantically similar embeddings closer and semantically dissimilar embeddings farther apart, addressing issues of anisotropy and uneven distribution in semantic embeddings.

  • (2) We introduced the static relation prototypes in our method. By incorporating the static relation prototypes and adjusting their ratio in comparison to dynamic relation prototypes, we adapted the relation prototypes to the feature space. This adjustment reduced the reliance of relation prototypes on memory samples and mitigated the problem of the model being overly sensitive to memory samples.

  • (3) We conducted comparative and ablation experiments on the widely used FewRel and TACRED datasets, demonstrating that our approach effectively improved semantic embeddings and relation prototypes, thereby enhancing the ability of the model to mitigate catastrophic forgetting.

2 Related Work

2.1 Relation Extraction

Traditional relation extraction (RE) methods can be classified into supervised and distant supervised according to the dependence of the training process on labelled data [16, 17].

Supervised RE methods aim to acquire distributed feature representations of data by combining low-level features into more abstract high-level features, thereby addressing two major issues in classical approaches: manual feature selection and the accumulation of feature extraction errors. Depending on whether the subtasks of entity recognition and relation classification are treated separately or jointly, supervised RE methods can be further divided into two categories: pipeline learning and joint learning [18]. Fu et al. [19], for instance, enhanced the prediction accuracy of overlapping relations in joint relation extraction models by incorporating graph convolutional networks (GCNs). Wang et al. [20] focused on strengthening the interaction between entities and relations by using a unified classifier to predict entity and relation labels. While supervised RE methods effectively tackle the aforementioned issues in classical methods and enhance the performance of relation extraction models, they typically rely on labelled data, whereas real-world scenarios often involve a majority of unlabelled data.

To fully utilize the information from unlabelled data and reduce manual annotation costs, researchers have introduced distant supervision into the RE task. Distant supervised RE methods automatically annotate unlabelled data by learning from labelled data, thereby expanding the knowledge base. Although machine-generated labelled data offers speed and cost advantages, achieving the same level of accuracy as human annotation is challenging. Inspired by generative adversarial networks, Qin et al.[21] used an optimized generator to filter distant supervision training datasets, redistributing false-positive samples to obtain a cleaner dataset. To combine the high precision of human annotation with the cost-efficiency of distant supervision, Jung et al.[22] designed a dual-supervision framework. This framework employs two independent networks, HA-Net and DS-Net, to predict labels for human-annotated and distantly supervised data, respectively, effectively leveraging the strengths of both approaches while mitigating the accuracy decline caused by errors in distant supervision.

2.2 Continual Learning

Continual Learning (CL), also referred to as Lifelong Learning (LL), aims to enable models to continuously acquire knowledge from new tasks while preserving their performance on previous tasks. Currently, the primary challenge in continual learning is the issue of catastrophic forgetting, and existing continual learning methods can be broadly classified into three categories.

Regularization-based methods mitigate catastrophic forgetting by adding regularization terms or extra losses to the loss function to control the changes in important parameters related to previous tasks. Typical methods include elastic weight consolidation (EWC) [23] and learning without forgetting (LwF) [24]. EWC regularizes crucial parameters to selectively slow down their learning, effectively retaining knowledge from previous tasks. LwF introduces distillation losses based on the outputs of the new model and fine-tunes the model on new tasks.

Parameter-isolation based methods allocate separate parameter spaces for each task, preventing mutual interference between parameters associated with new and old tasks to mitigate catastrophic forgetting. Prominent methods in this category include hard attention (HAT) [25] and progressive neural networks (PNN) [26]. HAT employs a hard attention mechanism to adaptively allocate the model's parameter space for different tasks, facilitating both parameters sharing and isolation among tasks. PNN assigns fixed-capacity subnetworks for training on new tasks.

Memory-based methods address catastrophic forgetting by storing samples or relevant information from previous tasks and replaying them during the learning of new tasks, preventing the model from forgetting previous tasks. Representative methods in this category include gradient episodic memory (GEM) [27] and embedding alignment for episodic memory replay (EA-EMR) [28]. GEM constructs inequality constraints using stored memory samples to ensure that losses on previous tasks can only decrease and not increase during training on new tasks. EA-EMR introduces explicit alignment in episodic memory replay models to reduce sentence embedding distortions during model training, thus alleviating forgetting of previous tasks.

Among these three categories of methods, memory-based approaches have been demonstrated by Wang et al., Sun et al., and D’Autume et al. [28,29,30] to hold significant promise in natural language processing tasks. Therefore, we adopt a memory-based approach to implement continual relation extraction.

3 Methodology

3.1 Task Formalization and Overall Framework

3.1.1 Task Formalization

Continual Relation Extraction (CRE) is to complete a series of relation extraction tasks sequentially. Suppose there is a sequence of RE tasks \(\left\{ {T_1 ,\;T_2 ,\; \ldots ,\;T_K } \right\}\), the k-th task Tk contains the relation set Rk, the training set Dk and the test set Qk. Each sample \(\left\{ {\left( {x_i ,\;y_i } \right)} \right\}_{i = 1}^{\left| {D_k \cup Q_k } \right|}\) consists of the example xi and its corresponding relation label \(y_i \in R_k\), where |·| is used to compute the number of elements in the set. After learning task Tk, the CRE model should not only be able to recognize new relations Rk learned from the current task Tk, but also identify all known relations \(\tilde{R}_k = \cup_{l = 1}^k R_l\) learned from the previous k-1 tasks.

In addition, we employ an episodic memory module to store memory samples for all known relations \(\tilde{R}_k\). For example, the memory sample set for relation \(r_j \in \tilde{R}_k\) in the episodic memory module is denoted as \(M^{r_j } = \left\{ {\left( {x_i^{r_j } ,\;y_i^{r_j } } \right)} \right\}_{i = 1}^O\), where \(j \in \left[ {1,\;\left| {\tilde{R}_k } \right|} \right]\), O represents the number of samples stored for relation rj. After learning the k-th task Tk, the memory sample set for all known relations \(\tilde{R}_k\) becomes \(\tilde{M}_k = \cup_{i = 1}^k M_i = \cup_{r_j \in \tilde{R}_k } M^{r_j }\), where Mi is the memory sample set of the i-th task Ti.

3.1.2 Overall Framework

Next, the training process of task Tk will be taken as an example to briefly illustrate the general framework of our proposed method SS-CRE. As shown in Fig. 1, it primarily consists of the following three steps:

Fig. 1
figure 1

The overall framework of the SS-CRE (taking the training process of task Tk as an example). The blue dashed line indicates the data stream, and the black solid and dashed lines indicate forward and backward propagation, respectively

Learning New Task: We learn new relations Rk on the union set \(D_k^{{\text{train}}}\) of the training set Dk and its augmented sample set \(D_k^{{\text{aug}}}\), and fine-tune the supervised SimCSE-BERT encoder simultaneously;

Selecting Memory Samples: We select memory samples for the relation set Rk from the training set Dk and store them in the episodic memory module;

Memory Replay, Activation, and Reconsolidation: We iteratively perform memory replay, activation, and reconsolidation. First, we learn new relations and remember old relations on the activation set Ak. Then, we consolidate all known relations \(\tilde{R}_k\) using the memory sample set \(\tilde{M}_k\) and the relation prototype set Pk obtained from the dynamic relation prototype set \(P_k^{{\rm{dyn}}}\) and static relation prototype set \(\tilde{P}_k^{{\rm{static}}}\).

3.2 The Supervised SimCSE-BERT Encoder

Most of the existing work utilizes BERT encoders to encode the semantic features of the examples, but literature [15] points out that the semantic embeddings obtained in this way suffer from anisotropy and uneven distribution. Anisotropy means that the embeddings tend to cluster in some specific directions. In continual relation extraction, the model needs to accurately distinguish between different entities and their relations. If the embeddings exhibit a high degree of anisotropy, the embeddings of different entities and relations may overlap excessively in the feature space, making it difficult for the model to distinguish them. And uneven distribution refers to the fact that the embedding distribution is too dense in some regions of the feature space while relatively sparse in other regions. In continual relation extraction, the model needs to extract rich and diverse relations from the examples. If the embedding distribution is uneven, then it may cause the model to overfit the high-frequency relations in the dataset and ignore the low-frequency but equally important relations, which in turn affects the model's performance.

To overcome the above problems, we use supervised SimCSE-BERT instead of BERT. where supervised SimCSE is a model based on contrastive learning, which utilizes a contrastive loss function to minimize the distance between pairs of positive samples (i.e., pairs of semantically similar sentences) while maximizing the distance between pairs of negative samples (i.e., pairs of semantically dissimilar sentences), to make each embedding point as as far away as possible from the other points, forming a uniformly distributed pattern. Through contrastive learning of BERT, the semantic embeddings generated by supervised SimCSE-BERT are more evenly distributed, which helps the model learn a more accurate and discriminative feature space.

The structure of the supervised SimCSE-BERT based encoder is shown in Fig. 2. To obtain the semantic embedding of example xi, first, four special tokens [E11], [E12], [E21] and [E22] are added at the beginning and end positions of the entity to highlight the presence of the entity in the sentence; then, the example xi with the tokens are input into the supervised SimCSE-BERT to generate the hidden layer representations \({{\varvec{e}}}_{11} ,\;{{\varvec{e}}}_{21} \in {\text{R}}^{d^h }\) of [E11] and [E21] (\(d^h\) is the dimension of the SimCSE-BERT hidden representation); finally, the two are spliced to obtain ei, and the semantic embedding hi of the example xi is obtained through the fully connected and normalized layers as follows,

$$ {{\varvec{h}}}_i = LN\left( {{{\varvec{W}}}[{{\varvec{e}}}_{11} ;{{\varvec{e}}}_{21} ] + {{\varvec{b}}}} \right) $$
(1)

where \({{\varvec{W}}} \in {\text{R}}^{d^s \times 2d^h }\) (\(d^s\) is the semantic embedding dimension) and \({{\varvec{b}}} \in {\text{R}}^{d^s }\) are the trainable parameters, ";" denotes the vector concatenation operation, and LN(·) represents the normalization layer operation.

Fig. 2
figure 2

Architecture of the supervised SimCSE-BERT encoder

3.3 Learning for New Tasks

When the task Tk comes up, the supervised SimCSE-BERT encoder has not learned any examples containing new relations before. Therefore, it is unable to extract the semantic features of them. A common practice is to fine-tune the encoder on the training set Dk to learn the relations in Rk. However, it has been pointed out in reference [13] that the representation of relations learnt in this way is not robust.

To address this issue, we build upon the EMAR-ACA model, which introduces the adversarial class augmentation mechanism ACA before learning a new task. Thus, the model can obtain more accurate and robust representations of the relations after training. The learning process of the new task Tk is shown in Fig. 1, and the loss function used is

$$ \mathcal L ({{\varvec{\theta}}}) = - \sum_{i = 1}^{\left| {D_k^{{\rm{train}}} } \right|} {\sum_{j = 1}^{\left| {\tilde{R}_k } \right|} {\delta_{y_i = r_j } \times \log \frac{{\exp \left( {g\left( {{{\varvec{h}}}_i ,\;{{\varvec{r}}}_j } \right)} \right)}}{{\sum_{l = 1}^{|\tilde{R}_k |} {\exp \left( {g\left( {{{\varvec{h}}}_i ,\;{{\varvec{r}}}_l } \right)} \right)} }}} } $$
(2)

where \(D_k^{{\rm{train}}}\) is the union set of the training set Dk and its augmented sample set \(D_k^{{\text{aug}}}\), hi represents the semantic embedding of the example xi in \(D_k^{{\rm{train}}}\), rj is the relation embedding of the j-th relation rj in the all known relation set \(\tilde{R}_k\), g(·) is the cosine similarity function that computes the semantic embedding and relation embedding, and θ represents the parameters that can be optimized, including relation embeddings and the parameters of the supervised SimCSE-BERT encoder. If yi equals rj, \(\delta_{y_i = r_j } = 1\), otherwise \(\delta_{y_i \ne r_j } = 0\). For each new relation, its embedding is first randomly initialized and then optimized using Eq. (2).

3.4 Selecting Typical Samples for Memory

After learning the task Tk, some typical samples from the training set Dk for the relation set Rk are selected as memory samples and stored in the episodic memory module, as shown in Fig. 1. Referring to previous work, firstly, the semantic embeddings of all the training samples of the relation \(r_j \in R_k\) are obtained by the supervised SimCSE-BERT encoder, where \(j \in \left[ {1,\;\left| {R_k } \right|} \right]\); secondly, the K-Means algorithm is applied to these embeddings with the number of clusters denoted as O, which corresponds to the number of samples stored for the relation rj; finally, for each cluster, the sample closest to the centroid is selected as a memory sample for that relation and stored in the episodic memory module.

3.5 Replay, Activation, and Reconsolidation

After learning a new task as well as selecting memory samples, memory replay, activation, and reconsolidation are used to enhance the model's ability to recognize and distinguish between old and new relations, as shown in Fig. 1.

3.5.1 Static Relation Prototypes

Relation prototype is a refined representation of a relation. Previous works usually use memory samples to compute relation prototypes simply, leading to their over-sensitivity to memory samples. To address this issue, we introduce static relation prototypes, which correspond to the mean semantic embeddings of training samples. Since the semantic embeddings of training samples will not change in the subsequent learning process due to the change of the feature space, the mean of them is referred to as the "static relation prototype" in the reference [14]. In contrast, the semantic embeddings of memory samples evolve with further learning, so the mean of them is termed the "dynamic relation prototype". In order to obtain the relation prototype of \(r_j \in \tilde{R}_k\), firstly, we need to compute and store the static relation prototype \({{\varvec{p}}}_{r_j }^{{\rm{static}}}\) of rj after learning a new task, and the computation formula is as follows.

$$ {{\varvec{p}}}_{r_j }^{{\rm{static}}} = \frac{1}{{\left| {D_k^{r_j } } \right|}}\sum_{i = 1}^{\left| {D_k^{r_j } } \right|} {{{\varvec{h}}}_i } $$
(3)

where \(D_k^{r_j }\) represents the training samples for relation rj in the training set Dk and hi corresponds to the semantic embedding of example xi in \(D_k^{r_j }\). Secondly, we use the memory sample set \(M^{r_j }\) of relation rj to calculate the dynamic relation prototype \({{\varvec{p}}}_{r_j }^{{\rm{dyn}}}\) as the following equation,

$$ {{\varvec{p}}}_{r_j }^{{\rm{dyn}}} = \frac{1}{{\left| {M^{r_j } } \right|}}\sum_{i = 1}^{\left| {M^{r_j } } \right|} {{{\varvec{h}}}_i } $$
(4)

where hi is the semantic embedding of the example \(x_i^{r_j }\) in \(M^{r_j }\). Finally, the static relation prototype \({{\varvec{p}}}_{r_j }^{{\rm{static}}}\) is fine-tuned by the dynamic relation prototype \({{\varvec{p}}}_{r_j }^{{\rm{dyn}}}\) to obtain the relation prototype \({{\varvec{p}}}_{r_j }\) adapted to the current feature space, as follows:

$$ {{\varvec{p}}}_{r_j } = \left( {1 - \beta } \right) \cdot {{\varvec{p}}}_{r_j }^{{\rm{static}}} + \beta \cdot {{\varvec{p}}}_{r_j }^{{\rm{dyn}}} $$
(5)

where \(\beta \in [0,1]\) is a hyperparameter to control the ratio of dynamic and static relation prototypes. Therefore, after learning the task Tk, we have the static relation prototype set \(\tilde{P}_k^{{\rm{static}}} = \cup_{i = 1}^k P_i^{{\rm{static}}} = \cup_{r_j \in \tilde{R}_k } {{\varvec{p}}}_{r_j }^{{\rm{static}}}\) for all known relations prototypes, the dynamic relation prototype set \(P_k^{{\rm{dyn}}} = \cup_{r_j \in \tilde{R}_k } {{\varvec{p}}}_{r_j }^{{\rm{dyn}}}\), and the relation prototype set \(P_k = \cup_{r_j \in \tilde{R}_k } {{\varvec{p}}}_{r_j }\), where \(P_i^{{\rm{static}}}\) is the static relation prototype set for the i-th task Ti stored in memory.

figure a

3.5.2 Memory Replay and Activation

The training set \(D_k\) and the memory sample set \(\tilde{M}_k\) are integrated into an activation set \(A_k\), in which the model continuously learns the old and new relations. The loss function used for this process is defined as follows:

$$ {\mathcal L^{A}}\left( {{\varvec{\theta}}} \right) = - \sum_{i = 1}^{\left| {A_k } \right|} {\sum_{j = 1}^{\left| {\tilde{R}_k } \right|} {\delta_{y_i = r_j } \times \log \frac{{\exp \left( {g\left( {{{\varvec{h}}}_i ,\;{{\varvec{r}}}_j } \right)} \right)}}{{\sum_{l = 1}^{\left| {\tilde{R}_k } \right|} {\exp \left( {g\left( {{{\varvec{h}}}_i ,\;{{\varvec{r}}}_l } \right)} \right)} }}} } $$
(6)

where hi is the semantic embedding of the example xi in Ak.

3.5.3 Memory Reconsolidation

Since only memory replay and activation are conducted, the model may suffer from overfitting due to the imbalance in the number of samples between new and old relationships. Therefore, each time memory replay and activation are conducted to grasp the old and new relations, memory reconsolidation is employed to strengthen this process, similar to the consolidation exercises of the human brain to maintain the stability of long-term memory. The loss function used for memory reconsolidation is as follows:

$$ {\mathcal L^{R}}\left( {{\varvec{\theta}}} \right) = - \sum_{j = 1}^{\left| {\tilde{R}_k } \right|} {\sum_{i = 1}^{\left| {M^{r_j } } \right|} {\log \frac{{\exp \left( {g\left( {{{\varvec{h}}}_i ,\;{{\varvec{p}}}_{r_j } } \right)} \right)}}{{\sum_{l = 1}^{\left| {\tilde{R}_k } \right|} {\exp \left( {g\left( {{{\varvec{h}}}_i ,\;{{\varvec{p}}}_{r_l } } \right)} \right)} }}} } $$
(7)

where hi represents the semantic embedding of the example \(x_i^{r_j }\) in the memory sample set \(M^{r_j }\) of the relation \(r_j \in \tilde{R}_k\), and \({{\varvec{p}}}_{r_j } \in P_k\) is the prototype of the relation rj, computed by Eq. (5).

3.6 Prediction

After training the task Tk, the memory sample set \(M^{r_j }\) for relation \(r_j \in \tilde{R}_k\) is obtained from the episodic memory module, and the final relation prototype \({\tilde{\user2{p}}}_{r_j }\) is computed for prediction:

$$ {\tilde{\user2{p}}}_{r_j } = \frac{{{{\varvec{r}}}_j + \sum_{i = 1}^{\left| {M^{r_j } } \right|} {{{\varvec{h}}}_i } }}{{1 + \left| {M^{r_j } } \right|}} $$
(8)

where \({{\varvec{r}}}_j\) is the relation embedding of the relation \(r_j \in \tilde{R}_k\) and hi is the semantic embedding of the example \(x_i^{r_j }\) in \(M^{r_j }\). For each example xi in the union set \(\tilde{Q}_k = \cup_{i = 1}^k Q_i\) of the first k task test sets, the score between example xi and relation \(r_j\) is defined as follows

$$ s_{x_i ,\;r_j } = g\left( {{{\varvec{h}}}_i ,\;{\tilde{\user2{p}}}_{r_j } } \right) $$
(9)

where hi is the semantic embedding of the example xi in the test set \(\tilde{Q}_k\). Finally, the predicted relation yi for xi is computed as follows:

$$ y_i = {\mathop {\arg \max }\limits_{r_j \in \tilde{R}_k }} s_{x_i ,\;r_j } $$
(10)

4 Experiments

4.1 Datasets

The experiments were conducted using two widely adopted datasets, with the training-test-validation split ratio of 3:1:1.

FewRel[31] was originally a supervised dataset designed for few-shot relation classification but has gradually been used for continual relation extraction. The data is sourced from Wikipedia and contains a total of 100 relations with 700 samples for each relation. To be consistent with previous works, 80 relations were used in the experiments.

TACRED[32] is a large-scale supervised relation extraction dataset constructed from the TAC KBP competition data containing 42 relations (including "no_relation") and 106,264 samples. Similarly, to maintain consistency with previous works, "no_relation" was removed, and each relation was constrained to have 320 training samples and 40 test samples.

Since the TACRED dataset is characterized by unbalanced relations and large semantic differences, the task difficulty is much greater than that of FewRel. In addition, in order to mitigate the impact on the experimental results due to the different task sequences, we set up 5 different task sequences for each dataset. To ensure a fair comparison of model performance, the task sequences set up are exactly the same as the previous works.

4.2 Compared Models

To evaluate the performance of SS-CRE, the experiment considers a comparison with five novel CRE models.

EMAR [8] introduces a memory consolidation module after memory replay and activation to enhance the long-term memory of old relations. (the encoder in EMAR with BERT).

RP-CRE [10] initializes the memory network using relation prototypes of memory samples, refining subsequent sample embeddings.

CRL [11] maintains the stability of the relation embeddings during memory replay by employing contrastive learning and knowledge distillation.

EMAR-ACA [13] builds upon EMAR by adding an adversarial class augmentation mechanism (ACA) to enhance the robustness of relation representations.

CEAR [14] designs memory-insensitive sensitive relation prototypes and memory augmentation to mitigate the overfitting problem, and proposes integrated training and focal knowledge distillation to better distinguish similar relations.

4.3 Experimental Setting

The experiments used RTX3090 (24 GB) and Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50 GHz (43 GB), and the model was implemented with Python3.8, PyTorch1.8.1 and Cuda11.1.

The experiment consisted of setting up 5 different task sequences for FewRel and TACRED. Each task sequence was created by randomly dividing all relations into 10 groups to simulate 10 tasks. In the FewRel dataset, each task consisted of 8 relations, while in the TACRED dataset, each task comprised 4 relations. The samples were encoded using supervised SimCSE-BERT with a word embedding dimension of 768 and a learning rate of 1e−5. The network parameters were optimized using the AdamW optimizer with a weight decay of 1e−2. The batch size of each step was 32, the size of the vocabulary size was 30,522, the number of memory samples stored for each relation was 10 (except for Sect. 2.4), and the hyperparameters \(\beta = 0.2\). The number of training rounds for learning a new task epoch1 is 2, and the number of training rounds for memory replay and activation and reconsolidation epoch2 is 2, where the number of iterations for memory replay and activation, and memory reconsolidation is 1 for iter1 and iter2.

4.4 Evaluation Metrics

Following the article [10], the average accuracy on all learnt tasks is used as an evaluation metric. For each task sequence, the accuracy ACCk of the model on the current task Tk is calculated as follows,

$$ ACC_k = acc_{k,\;\tilde{Q}_k } $$
(11)

where \(acc_{k,\;\tilde{Q}_k }\) is the classification accuracy on the union set \(\tilde{Q}_k = \cup_{i = 1}^k Q_i\) of the first k test sets for all known relations.

4.5 Overall Performance Comparison

Tables 1 and 2 record the average accuracy (%) of the different CRE models on all learned tasks in the FewRel and TACRED datasets, respectively. Based on these results, the following conclusions can be drawn:

  • (1) Observing Tables 1 and 2, we can find that the method SS-CRE proposed in this paper achieves the highest average accuracy on several tasks, especially on the latter ones. It also demonstrates certain advantages compared with EMAR-ACA, indicating that improvements in semantic embeddings and relation prototypes can effectively enhance the performance of the model, making it more stable during continual learning.

  • (2) Comparing Tables 1 and 2, it can be found that all models perform significantly worse on the TACRED dataset. Analyzed from the dataset perspective, the primary reason for this could be the characteristics of the TACRED dataset, which include class imbalance and substantial semantic differences, making continual relation extraction tasks more challenging. However, compared with EMAR-ACA, the proposed method still results in improved model performance on the TACRED dataset (1.61%) and outperforms the class-balanced dataset FewRel (0.34%). This suggests that SS-CRE is more robust in class-imbalanced scenarios.

  • (3) Observing Table 2, it can be found that although CRL outperforms SS-CRE on T1 and T2, with the increasing number of tasks, SS-CRE is fully trained and the model performance degradation is smaller than that of CRL.This indicates that SS-CRE can better solve the catastrophic forgetting problem compared with CRL.

  • (4) Compared with CEAR, SS-CRE also has some advantages. Although the average accuracy of CEAR is higher than that of SS-CRE on the first few tasks, with the increase of the number of tasks, we eventually surpass CEAR and achieve higher results, indicating that SS-CRE is superior to CEAR in long-term memory.

Table 1 Comparative experimental results of the models on the FewRel dataset
Table 2 Experimental results for each model on the TACRED dataset

4.6 Ablation Study

To validate the effectiveness of the static relation prototype pstatic and the supervised SimCSE-BERT, ablation experiments were conducted. Table 3 reports the average accuracy (%) of each model on the last five tasks on the TACRED dataset. In Table 3, " + pstatic" indicates the addition of the static relation prototype to EMAR-ACA, while " + sup. SimCSE-BERT" denotes the replacement of the BERT encoder in EMAR-ACA with the supervised SimCSE-BERT encoder. Based on the experimental results in Table 3, the following conclusions can be drawn:

  • (1) Comparing EMAR-ACA with " + pstatic", there is a slight improvement in model performance (0.54%). This suggests that incorporating information from the training samples when calculating relation prototypes can effectively reduce the reliance on memory samples, addressing the issue of the model being overly sensitive to memory samples. However, the performance improvement is not significant, which may be due to the class imbalance in the TACRED dataset. If there were originally few samples for a specific relation in the training set, the enhancement from adding the static relation prototype pstatic might not be pronounced.

  • (2) Comparing EMAR-ACA and " + sup. SimCSE-BERT", a significant performance improvement (1.53%) is achieved by modifying the basic structure of the BERT encoder with supervised SimCSE-BERT. The reason behind this improvement may be that the semantic embeddings generated by the BERT encoder suffer from anisotropy and uneven distribution, which makes the cosine similarity function not a good measure of the similarity between the semantic embeddings and the relation embeddings. Moreover, the TACRED dataset is characterized by class imbalance, which will exacerbate the problem of uneven distribution of semantic embeddings. The supervised SimCSE-BERT encoder adopted by SS-CRE has a contrast learning framework, effectively addressing the limitations of the BERT encoder and enabling the model to obtain semantic embeddings with richer and more accurate information.

Table 3 Results of ablation experiments for each model on the TACRED dataset

4.7 Effect of Memory Size

Memory size refers to the number of memory samples needed to be stored for each relation in CRE. In this section, we explore the effect of memory size on the performance of SS-CRE. Taking EMAR-ACA as a reference and keeping all configurations and task sequences consistent, we compared three memory sizes: 5, 10, and 20. The experimental results are shown in Fig. 3. From Fig. 3, we can obtain the following conclusions:

  • (1) Observing the overall trend in the line chart, it is evident that as memory size decreases, the model's performance gradually declines, indicating that memory size is an important factor affecting the model performance. However, SS-CRE maintains a higher average accuracy for all different memory sizes, suggesting that SS-CRE reduces the sensitivity of the model to memory samples.

  • (2) Observing Fig. 3, it becomes obvious that under the condition of the same memory size, SS-CRE's performance is more stable than EMAR-ACA. This indicates that SS-CRE enhances the ability to mitigate catastrophic forgetting.

Fig. 3
figure 3

Illustrates the effect of varying memory sizes on the models across two different datasets. a Results on FewRel. b Results on TACRED

5 Conclusion

This paper introduces a novel CRE method that enhances model performance by improving semantic embeddings and relation prototypes. Specifically, in the sample encoding phase, the use of supervised SimCSE-BERT in place of BERT addresses the issues of anisotropy and uneven distribution in semantic embeddings, resulting in more accurate sample information. In the memory consolidation phase, the static relation prototype is introduced to reduce the sensitivity of the model to memory samples by incorporating information from training samples. The performance of the model is improved on both widely used datasets, validating the effectiveness of the method in this paper. In future work, we will combine few-shot learning with continual relation extraction to make the model more applicable in practical scenarios.