SS-CRE: A Continual Relation Extraction Method Through SimCSE-BERT and Static Relation Prototypes

Chen, Jinguang; Wang, Suyue; Ma, Lili; Yang, Bo; Zhang, Kaibing

doi:10.1007/s11063-024-11647-4

SS-CRE: A Continual Relation Extraction Method Through SimCSE-BERT and Static Relation Prototypes

Open access
Published: 20 June 2024

Volume 56, article number 203, (2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Processing Letters Aims and scope Submit manuscript

SS-CRE: A Continual Relation Extraction Method Through SimCSE-BERT and Static Relation Prototypes

Download PDF

Jinguang Chen¹,
Suyue Wang¹,
Lili Ma¹,
Bo Yang¹ &
…
Kaibing Zhang¹

46 Accesses
Explore all metrics

Abstract

Continual relation extraction aims to learn new relations from a continuous stream of data while avoiding forgetting old relations. Existing methods typically use the BERT encoder to obtain semantic embeddings, ignoring the fact that the vector representations suffer from anisotropy and uneven distribution. Furthermore, the relation prototypes are usually computed by memory samples directly, resulting in the model being overly sensitive to memory samples. To solve these problems, we propose a new continual relation extraction method. Firstly, we modified the basic structure of the sample encoder to generate uniformly distributed semantic embeddings using the supervised SimCSE-BERT to obtain richer sample information. Secondly, we introduced static relation prototypes and dynamically adjust their proportion with dynamic relation prototypes to adapt to the feature space. Lastly, through experimental analysis on the widely used FewRel and TACRED datasets, the results demonstrate that the proposed method effectively enhances semantic embeddings and relation prototypes, resulting in a further alleviation of catastrophic forgetting in the model. The code will be soon released at https://github.com/SuyueW/SS-CRE.

Impact of word embedding models on text analytics in deep learning environment: a review

Article 22 February 2023

Using word embedding to detect keywords in texts modeled as complex networks

Article 09 June 2024

A survey of graph neural networks in various learning paradigms: methods, applications, and challenges

Article 23 November 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Relation extraction (RE) [1], as a fundamental task in Natural Language Processing, has been widely used in many downstream tasks such as knowledge graphs (KGs) [2], question answering system (QA) [3], etc. Its primary objective is to detect relations between two or more entities in a text. For instance, given the sentence "Beijing is the capital of the China," the model needs to identify the relation "capital of" between the entity pair [Beijing, the China].

Traditional RE methods typically assume that the relations to be predicted belong to a fixed set of predefined relations. They train a model once on the fixed dataset, without considering real-world persistence and iteration. In order to make the model applicable to real-world scenarios, scholars have introduced continual learning (CL) [4,5,6] into relation extraction and proposed continual relation extraction (CRE). Compared with traditional RE, CRE aims to help models learn new relations while maintaining accurate classification of old relations. However, since neural networks suffer from catastrophic forgetting (CF) [7] problem, which makes the new parameters learnt by the model from the new task overwrite the old ones, causing its performance on the old task to drop off a cliff. In order to achieve CRE, it is necessary to address the catastrophic forgetting problem. Recent work has focused on solving this problem using memory-based approaches.

Memory-based approaches usually store some typical training samples as memory samples for old relations and replay them during subsequent learning of new relations to avoid forgetting. However, during memory replay, models often suffer from overfitting due to the relatively small number of memory samples. In 2020, Han et al. [8] introduced an episodic memory activation and reconsolidation method (EMAR) to CRE. After memory replay and activation, it reconsolidates all known relations using the memory sample set. This approach effectively mitigates overfitting and strengthens the model's long-term memory of old relations. In 2021, Wu et al. [9] combined curriculum learning and meta-learning, proposing Curriculum-Meta Learning (CML). It selectively reduces the replay frequency of memory samples to prevent overfitting and guides the model to learn the deviation between the current task and the most similar previous task, reducing sequence sensitivity.

Regarding memory-based methods, the selection and utilization of memory samples are crucial for mitigating catastrophic forgetting. In 2021, Cui et al. [10] introduced prototype networks into CRE to make better use of memory samples' information. They improved sample embeddings with relation prototypes and proposed prototype refinement to effectively utilize the information stored in memory, reducing the model's dependence on the number of memory samples and enhancing its performance. In 2022, Zhao et al. [11] introduced contrastive learning into CRE and proposed a consistent representation learning method (CRL). It constrained embeddings of old tasks not to change significantly using supervised contrastive learning and knowledge distillation. Hu et al. [12] found experimentally that catastrophic forgetting of the model leads to indistinguishable data distributions of the old and new tasks, and then designed a CRE framework consisting of a classification network and a contrast network (CRECL). In the contrast network, a given example is compared with each candidate relation prototype to make full use of the negative correlation information and improve the consistency of the data distribution, thus mitigating catastrophic forgetting.

Most methods attribute catastrophic forgetting to the damage caused to representations learned for old relations when new relations appear. This implicitly assumes that models have already learned old relations sufficiently. However, in 2022, Wang et al. [13] empirically found that this assumption may not hold true. They proposed an adversarial class augmentation mechanism (ACA) to enhance advanced CRE models like EMAR. The introduced mechanism helps the model learn more precise and robust representations. Successively, Zhao et al. [14] introduced integrated training and focal knowledge distillation in order to improve the performance of the model on similar relations, and designed a memory-insensitive relation prototype and a memory enhancement strategy to overcome the memory sample overfitting problem.

The aforementioned methods alleviate catastrophic forgetting by addressing issues such as overfitting of memory samples and improving memory sample utilization. However, they overlook the inherent problems in semantic embeddings generated by BERT encoders, such as anisotropy and uneven distribution. Additionally, they do not sufficiently consider the impact of memory samples on model performance. Therefore, in our work, we build upon the EMAR model with the ACA mechanism (EMAR-ACA) and propose a continual relation extraction method called SS-CRE (Supervised SimCSE-BERT and Static Relation Prototypes for Continual Relation Extraction). Our main contributions are as follows:

(1) We employed the supervised SimCSE-BERT [15] as the foundational structure for the encoder. Leveraging its contrastive learning framework, we made semantically similar embeddings closer and semantically dissimilar embeddings farther apart, addressing issues of anisotropy and uneven distribution in semantic embeddings.
(2) We introduced the static relation prototypes in our method. By incorporating the static relation prototypes and adjusting their ratio in comparison to dynamic relation prototypes, we adapted the relation prototypes to the feature space. This adjustment reduced the reliance of relation prototypes on memory samples and mitigated the problem of the model being overly sensitive to memory samples.
(3) We conducted comparative and ablation experiments on the widely used FewRel and TACRED datasets, demonstrating that our approach effectively improved semantic embeddings and relation prototypes, thereby enhancing the ability of the model to mitigate catastrophic forgetting.

2 Related Work

2.1 Relation Extraction

Traditional relation extraction (RE) methods can be classified into supervised and distant supervised according to the dependence of the training process on labelled data [16, 17].

Supervised RE methods aim to acquire distributed feature representations of data by combining low-level features into more abstract high-level features, thereby addressing two major issues in classical approaches: manual feature selection and the accumulation of feature extraction errors. Depending on whether the subtasks of entity recognition and relation classification are treated separately or jointly, supervised RE methods can be further divided into two categories: pipeline learning and joint learning [18]. Fu et al. [19], for instance, enhanced the prediction accuracy of overlapping relations in joint relation extraction models by incorporating graph convolutional networks (GCNs). Wang et al. [20] focused on strengthening the interaction between entities and relations by using a unified classifier to predict entity and relation labels. While supervised RE methods effectively tackle the aforementioned issues in classical methods and enhance the performance of relation extraction models, they typically rely on labelled data, whereas real-world scenarios often involve a majority of unlabelled data.

To fully utilize the information from unlabelled data and reduce manual annotation costs, researchers have introduced distant supervision into the RE task. Distant supervised RE methods automatically annotate unlabelled data by learning from labelled data, thereby expanding the knowledge base. Although machine-generated labelled data offers speed and cost advantages, achieving the same level of accuracy as human annotation is challenging. Inspired by generative adversarial networks, Qin et al.[21] used an optimized generator to filter distant supervision training datasets, redistributing false-positive samples to obtain a cleaner dataset. To combine the high precision of human annotation with the cost-efficiency of distant supervision, Jung et al.[22] designed a dual-supervision framework. This framework employs two independent networks, HA-Net and DS-Net, to predict labels for human-annotated and distantly supervised data, respectively, effectively leveraging the strengths of both approaches while mitigating the accuracy decline caused by errors in distant supervision.

2.2 Continual Learning

Continual Learning (CL), also referred to as Lifelong Learning (LL), aims to enable models to continuously acquire knowledge from new tasks while preserving their performance on previous tasks. Currently, the primary challenge in continual learning is the issue of catastrophic forgetting, and existing continual learning methods can be broadly classified into three categories.

Regularization-based methods mitigate catastrophic forgetting by adding regularization terms or extra losses to the loss function to control the changes in important parameters related to previous tasks. Typical methods include elastic weight consolidation (EWC) [23] and learning without forgetting (LwF) [24]. EWC regularizes crucial parameters to selectively slow down their learning, effectively retaining knowledge from previous tasks. LwF introduces distillation losses based on the outputs of the new model and fine-tunes the model on new tasks.

Parameter-isolation based methods allocate separate parameter spaces for each task, preventing mutual interference between parameters associated with new and old tasks to mitigate catastrophic forgetting. Prominent methods in this category include hard attention (HAT) [25] and progressive neural networks (PNN) [26]. HAT employs a hard attention mechanism to adaptively allocate the model's parameter space for different tasks, facilitating both parameters sharing and isolation among tasks. PNN assigns fixed-capacity subnetworks for training on new tasks.

Memory-based methods address catastrophic forgetting by storing samples or relevant information from previous tasks and replaying them during the learning of new tasks, preventing the model from forgetting previous tasks. Representative methods in this category include gradient episodic memory (GEM) [27] and embedding alignment for episodic memory replay (EA-EMR) [28]. GEM constructs inequality constraints using stored memory samples to ensure that losses on previous tasks can only decrease and not increase during training on new tasks. EA-EMR introduces explicit alignment in episodic memory replay models to reduce sentence embedding distortions during model training, thus alleviating forgetting of previous tasks.

Among these three categories of methods, memory-based approaches have been demonstrated by Wang et al., Sun et al., and D’Autume et al. [28,29,30] to hold significant promise in natural language processing tasks. Therefore, we adopt a memory-based approach to implement continual relation extraction.

3 Methodology

3.1 Task Formalization and Overall Framework

3.1.1 Task Formalization

Continual Relation Extraction (CRE) is to complete a series of relation extraction tasks sequentially. Suppose there is a sequence of RE tasks $\left\{ {T_1 ,\;T_2 ,\; \ldots ,\;T_K } \right\}$, the k-th task T_k contains the relation set R_k, the training set D_k and the test set Q_k. Each sample $\left\{ {\left( {x_i ,\;y_i } \right)} \right\}_{i = 1}^{\left| {D_k \cup Q_k } \right|}$ consists of the example x_i and its corresponding relation label $y_i \in R_k$, where |·| is used to compute the number of elements in the set. After learning task T_k, the CRE model should not only be able to recognize new relations R_k learned from the current task T_k, but also identify all known relations $\tilde{R}_k = \cup_{l = 1}^k R_l$ learned from the previous k-1 tasks.

In addition, we employ an episodic memory module to store memory samples for all known relations $\tilde{R}_k$. For example, the memory sample set for relation $r_j \in \tilde{R}_k$ in the episodic memory module is denoted as $M^{r_j } = \left\{ {\left( {x_i^{r_j } ,\;y_i^{r_j } } \right)} \right\}_{i = 1}^O$, where $j \in \left[ {1,\;\left| {\tilde{R}_k } \right|} \right]$, O represents the number of samples stored for relation r_j. After learning the k-th task T_k, the memory sample set for all known relations $\tilde{R}_k$ becomes $\tilde{M}_k = \cup_{i = 1}^k M_i = \cup_{r_j \in \tilde{R}_k } M^{r_j }$, where M_i is the memory sample set of the i-th task T_i.

3.1.2 Overall Framework

Next, the training process of task T_k will be taken as an example to briefly illustrate the general framework of our proposed method SS-CRE. As shown in Fig. 1, it primarily consists of the following three steps:

Learning New Task: We learn new relations R_k on the union set $D_k^{{\text{train}}}$ of the training set D_k and its augmented sample set $D_k^{{\text{aug}}}$, and fine-tune the supervised SimCSE-BERT encoder simultaneously;

Selecting Memory Samples: We select memory samples for the relation set R_k from the training set D_k and store them in the episodic memory module;

Memory Replay, Activation, and Reconsolidation: We iteratively perform memory replay, activation, and reconsolidation. First, we learn new relations and remember old relations on the activation set A_k. Then, we consolidate all known relations $\tilde{R}_k$ using the memory sample set $\tilde{M}_k$ and the relation prototype set P_k obtained from the dynamic relation prototype set $P_k^{{\rm{dyn}}}$ and static relation prototype set $\tilde{P}_k^{{\rm{static}}}$.

3.2 The Supervised SimCSE-BERT Encoder

Most of the existing work utilizes BERT encoders to encode the semantic features of the examples, but literature [15] points out that the semantic embeddings obtained in this way suffer from anisotropy and uneven distribution. Anisotropy means that the embeddings tend to cluster in some specific directions. In continual relation extraction, the model needs to accurately distinguish between different entities and their relations. If the embeddings exhibit a high degree of anisotropy, the embeddings of different entities and relations may overlap excessively in the feature space, making it difficult for the model to distinguish them. And uneven distribution refers to the fact that the embedding distribution is too dense in some regions of the feature space while relatively sparse in other regions. In continual relation extraction, the model needs to extract rich and diverse relations from the examples. If the embedding distribution is uneven, then it may cause the model to overfit the high-frequency relations in the dataset and ignore the low-frequency but equally important relations, which in turn affects the model's performance.

To overcome the above problems, we use supervised SimCSE-BERT instead of BERT. where supervised SimCSE is a model based on contrastive learning, which utilizes a contrastive loss function to minimize the distance between pairs of positive samples (i.e., pairs of semantically similar sentences) while maximizing the distance between pairs of negative samples (i.e., pairs of semantically dissimilar sentences), to make each embedding point as as far away as possible from the other points, forming a uniformly distributed pattern. Through contrastive learning of BERT, the semantic embeddings generated by supervised SimCSE-BERT are more evenly distributed, which helps the model learn a more accurate and discriminative feature space.

The structure of the supervised SimCSE-BERT based encoder is shown in Fig. 2. To obtain the semantic embedding of example x_i, first, four special tokens [E₁₁], [E₁₂], [E₂₁] and [E₂₂] are added at the beginning and end positions of the entity to highlight the presence of the entity in the sentence; then, the example x_i with the tokens are input into the supervised SimCSE-BERT to generate the hidden layer representations ${{\varvec{e}}}_{11} ,\;{{\varvec{e}}}_{21} \in {\text{R}}^{d^h }$ of [E₁₁] and [E₂₁] ($d^h$ is the dimension of the SimCSE-BERT hidden representation); finally, the two are spliced to obtain e_i, and the semantic embedding h_i of the example x_i is obtained through the fully connected and normalized layers as follows,

$$ {{\varvec{h}}}_i = LN\left( {{{\varvec{W}}}[{{\varvec{e}}}_{11} ;{{\varvec{e}}}_{21} ] + {{\varvec{b}}}} \right) $$

(1)

where ${{\varvec{W}}} \in {\text{R}}^{d^s \times 2d^h }$ ($d^s$ is the semantic embedding dimension) and ${{\varvec{b}}} \in {\text{R}}^{d^s }$ are the trainable parameters, ";" denotes the vector concatenation operation, and LN(·) represents the normalization layer operation.

3.3 Learning for New Tasks

When the task T_k comes up, the supervised SimCSE-BERT encoder has not learned any examples containing new relations before. Therefore, it is unable to extract the semantic features of them. A common practice is to fine-tune the encoder on the training set D_k to learn the relations in R_k. However, it has been pointed out in reference [13] that the representation of relations learnt in this way is not robust.

To address this issue, we build upon the EMAR-ACA model, which introduces the adversarial class augmentation mechanism ACA before learning a new task. Thus, the model can obtain more accurate and robust representations of the relations after training. The learning process of the new task T_k is shown in Fig. 1, and the loss function used is

$$ \mathcal L ({{\varvec{\theta}}}) = - \sum_{i = 1}^{\left| {D_k^{{\rm{train}}} } \right|} {\sum_{j = 1}^{\left| {\tilde{R}_k } \right|} {\delta_{y_i = r_j } \times \log \frac{{\exp \left( {g\left( {{{\varvec{h}}}_i ,\;{{\varvec{r}}}_j } \right)} \right)}}{{\sum_{l = 1}^{|\tilde{R}_k |} {\exp \left( {g\left( {{{\varvec{h}}}_i ,\;{{\varvec{r}}}_l } \right)} \right)} }}} } $$

(2)

where $D_k^{{\rm{train}}}$ is the union set of the training set D_k and its augmented sample set $D_k^{{\text{aug}}}$, h_i represents the semantic embedding of the example x_i in $D_k^{{\rm{train}}}$, r_j is the relation embedding of the j-th relation r_j in the all known relation set $\tilde{R}_k$, g(·) is the cosine similarity function that computes the semantic embedding and relation embedding, and θ represents the parameters that can be optimized, including relation embeddings and the parameters of the supervised SimCSE-BERT encoder. If y_i equals r_j, $\delta_{y_i = r_j } = 1$, otherwise $\delta_{y_i \ne r_j } = 0$. For each new relation, its embedding is first randomly initialized and then optimized using Eq. (2).

3.4 Selecting Typical Samples for Memory

After learning the task T_k, some typical samples from the training set D_k for the relation set R_k are selected as memory samples and stored in the episodic memory module, as shown in Fig. 1. Referring to previous work, firstly, the semantic embeddings of all the training samples of the relation $r_j \in R_k$ are obtained by the supervised SimCSE-BERT encoder, where $j \in \left[ {1,\;\left| {R_k } \right|} \right]$; secondly, the K-Means algorithm is applied to these embeddings with the number of clusters denoted as O, which corresponds to the number of samples stored for the relation r_j; finally, for each cluster, the sample closest to the centroid is selected as a memory sample for that relation and stored in the episodic memory module.

3.5 Replay, Activation, and Reconsolidation

After learning a new task as well as selecting memory samples, memory replay, activation, and reconsolidation are used to enhance the model's ability to recognize and distinguish between old and new relations, as shown in Fig. 1.

3.5.1 Static Relation Prototypes

Relation prototype is a refined representation of a relation. Previous works usually use memory samples to compute relation prototypes simply, leading to their over-sensitivity to memory samples. To address this issue, we introduce static relation prototypes, which correspond to the mean semantic embeddings of training samples. Since the semantic embeddings of training samples will not change in the subsequent learning process due to the change of the feature space, the mean of them is referred to as the "static relation prototype" in the reference [14]. In contrast, the semantic embeddings of memory samples evolve with further learning, so the mean of them is termed the "dynamic relation prototype". In order to obtain the relation prototype of $r_j \in \tilde{R}_k$, firstly, we need to compute and store the static relation prototype ${{\varvec{p}}}_{r_j }^{{\rm{static}}}$ of r_j after learning a new task, and the computation formula is as follows.

$$ {{\varvec{p}}}_{r_j }^{{\rm{static}}} = \frac{1}{{\left| {D_k^{r_j } } \right|}}\sum_{i = 1}^{\left| {D_k^{r_j } } \right|} {{{\varvec{h}}}_i } $$

(3)

where $D_k^{r_j }$ represents the training samples for relation r_j in the training set D_k and h_i corresponds to the semantic embedding of example x_i in $D_k^{r_j }$. Secondly, we use the memory sample set $M^{r_j }$ of relation r_j to calculate the dynamic relation prototype ${{\varvec{p}}}_{r_j }^{{\rm{dyn}}}$ as the following equation,

$$ {{\varvec{p}}}_{r_j }^{{\rm{dyn}}} = \frac{1}{{\left| {M^{r_j } } \right|}}\sum_{i = 1}^{\left| {M^{r_j } } \right|} {{{\varvec{h}}}_i } $$

(4)

where h_i is the semantic embedding of the example $x_i^{r_j }$ in $M^{r_j }$. Finally, the static relation prototype ${{\varvec{p}}}_{r_j }^{{\rm{static}}}$ is fine-tuned by the dynamic relation prototype ${{\varvec{p}}}_{r_j }^{{\rm{dyn}}}$ to obtain the relation prototype ${{\varvec{p}}}_{r_j }$ adapted to the current feature space, as follows:

$$ {{\varvec{p}}}_{r_j } = \left( {1 - \beta } \right) \cdot {{\varvec{p}}}_{r_j }^{{\rm{static}}} + \beta \cdot {{\varvec{p}}}_{r_j }^{{\rm{dyn}}} $$

(5)

where $\beta \in [0,1]$ is a hyperparameter to control the ratio of dynamic and static relation prototypes. Therefore, after learning the task T_k, we have the static relation prototype set $\tilde{P}_k^{{\rm{static}}} = \cup_{i = 1}^k P_i^{{\rm{static}}} = \cup_{r_j \in \tilde{R}_k } {{\varvec{p}}}_{r_j }^{{\rm{static}}}$ for all known relations prototypes, the dynamic relation prototype set $P_k^{{\rm{dyn}}} = \cup_{r_j \in \tilde{R}_k } {{\varvec{p}}}_{r_j }^{{\rm{dyn}}}$, and the relation prototype set $P_k = \cup_{r_j \in \tilde{R}_k } {{\varvec{p}}}_{r_j }$, where $P_i^{{\rm{static}}}$ is the static relation prototype set for the i-th task T_i stored in memory.

3.5.2 Memory Replay and Activation

The training set $D_k$ and the memory sample set $\tilde{M}_k$ are integrated into an activation set $A_k$, in which the model continuously learns the old and new relations. The loss function used for this process is defined as follows:

$$ {\mathcal L^{A}}\left( {{\varvec{\theta}}} \right) = - \sum_{i = 1}^{\left| {A_k } \right|} {\sum_{j = 1}^{\left| {\tilde{R}_k } \right|} {\delta_{y_i = r_j } \times \log \frac{{\exp \left( {g\left( {{{\varvec{h}}}_i ,\;{{\varvec{r}}}_j } \right)} \right)}}{{\sum_{l = 1}^{\left| {\tilde{R}_k } \right|} {\exp \left( {g\left( {{{\varvec{h}}}_i ,\;{{\varvec{r}}}_l } \right)} \right)} }}} } $$

(6)

where h_i is the semantic embedding of the example x_i in A_k.

3.5.3 Memory Reconsolidation

Since only memory replay and activation are conducted, the model may suffer from overfitting due to the imbalance in the number of samples between new and old relationships. Therefore, each time memory replay and activation are conducted to grasp the old and new relations, memory reconsolidation is employed to strengthen this process, similar to the consolidation exercises of the human brain to maintain the stability of long-term memory. The loss function used for memory reconsolidation is as follows:

$$ {\mathcal L^{R}}\left( {{\varvec{\theta}}} \right) = - \sum_{j = 1}^{\left| {\tilde{R}_k } \right|} {\sum_{i = 1}^{\left| {M^{r_j } } \right|} {\log \frac{{\exp \left( {g\left( {{{\varvec{h}}}_i ,\;{{\varvec{p}}}_{r_j } } \right)} \right)}}{{\sum_{l = 1}^{\left| {\tilde{R}_k } \right|} {\exp \left( {g\left( {{{\varvec{h}}}_i ,\;{{\varvec{p}}}_{r_l } } \right)} \right)} }}} } $$

(7)

where h_i represents the semantic embedding of the example $x_i^{r_j }$ in the memory sample set $M^{r_j }$ of the relation $r_j \in \tilde{R}_k$, and ${{\varvec{p}}}_{r_j } \in P_k$ is the prototype of the relation r_j, computed by Eq. (5).

3.6 Prediction

After training the task T_k, the memory sample set $M^{r_j }$ for relation $r_j \in \tilde{R}_k$ is obtained from the episodic memory module, and the final relation prototype ${\tilde{\user2{p}}}_{r_j }$ is computed for prediction:

$$ {\tilde{\user2{p}}}_{r_j } = \frac{{{{\varvec{r}}}_j + \sum_{i = 1}^{\left| {M^{r_j } } \right|} {{{\varvec{h}}}_i } }}{{1 + \left| {M^{r_j } } \right|}} $$

(8)

where ${{\varvec{r}}}_j$ is the relation embedding of the relation $r_j \in \tilde{R}_k$ and h_i is the semantic embedding of the example $x_i^{r_j }$ in $M^{r_j }$. For each example x_i in the union set $\tilde{Q}_k = \cup_{i = 1}^k Q_i$ of the first k task test sets, the score between example x_i and relation $r_j$ is defined as follows

$$ s_{x_i ,\;r_j } = g\left( {{{\varvec{h}}}_i ,\;{\tilde{\user2{p}}}_{r_j } } \right) $$

(9)

where h_i is the semantic embedding of the example x_i in the test set $\tilde{Q}_k$. Finally, the predicted relation y_i for x_i is computed as follows:

$$ y_i = {\mathop {\arg \max }\limits_{r_j \in \tilde{R}_k }} s_{x_i ,\;r_j } $$

(10)

4 Experiments

4.1 Datasets

The experiments were conducted using two widely adopted datasets, with the training-test-validation split ratio of 3:1:1.

FewRel[31] was originally a supervised dataset designed for few-shot relation classification but has gradually been used for continual relation extraction. The data is sourced from Wikipedia and contains a total of 100 relations with 700 samples for each relation. To be consistent with previous works, 80 relations were used in the experiments.

TACRED[32] is a large-scale supervised relation extraction dataset constructed from the TAC KBP competition data containing 42 relations (including "no_relation") and 106,264 samples. Similarly, to maintain consistency with previous works, "no_relation" was removed, and each relation was constrained to have 320 training samples and 40 test samples.

Since the TACRED dataset is characterized by unbalanced relations and large semantic differences, the task difficulty is much greater than that of FewRel. In addition, in order to mitigate the impact on the experimental results due to the different task sequences, we set up 5 different task sequences for each dataset. To ensure a fair comparison of model performance, the task sequences set up are exactly the same as the previous works.

4.2 Compared Models

To evaluate the performance of SS-CRE, the experiment considers a comparison with five novel CRE models.

EMAR [8] introduces a memory consolidation module after memory replay and activation to enhance the long-term memory of old relations. (the encoder in EMAR with BERT).

RP-CRE [10] initializes the memory network using relation prototypes of memory samples, refining subsequent sample embeddings.

CRL [11] maintains the stability of the relation embeddings during memory replay by employing contrastive learning and knowledge distillation.

EMAR-ACA [13] builds upon EMAR by adding an adversarial class augmentation mechanism (ACA) to enhance the robustness of relation representations.

CEAR [14] designs memory-insensitive sensitive relation prototypes and memory augmentation to mitigate the overfitting problem, and proposes integrated training and focal knowledge distillation to better distinguish similar relations.

4.3 Experimental Setting

The experiments used RTX3090 (24 GB) and Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50 GHz (43 GB), and the model was implemented with Python3.8, PyTorch1.8.1 and Cuda11.1.

The experiment consisted of setting up 5 different task sequences for FewRel and TACRED. Each task sequence was created by randomly dividing all relations into 10 groups to simulate 10 tasks. In the FewRel dataset, each task consisted of 8 relations, while in the TACRED dataset, each task comprised 4 relations. The samples were encoded using supervised SimCSE-BERT with a word embedding dimension of 768 and a learning rate of 1e⁻⁵. The network parameters were optimized using the AdamW optimizer with a weight decay of 1e⁻². The batch size of each step was 32, the size of the vocabulary size was 30,522, the number of memory samples stored for each relation was 10 (except for Sect. 2.4), and the hyperparameters $\beta = 0.2$. The number of training rounds for learning a new task epoch₁ is 2, and the number of training rounds for memory replay and activation and reconsolidation epoch₂ is 2, where the number of iterations for memory replay and activation, and memory reconsolidation is 1 for iter₁ and iter₂.

4.4 Evaluation Metrics

Following the article [10], the average accuracy on all learnt tasks is used as an evaluation metric. For each task sequence, the accuracy ACC_k of the model on the current task T_k is calculated as follows,

$$ ACC_k = acc_{k,\;\tilde{Q}_k } $$

(11)

where $acc_{k,\;\tilde{Q}_k }$ is the classification accuracy on the union set $\tilde{Q}_k = \cup_{i = 1}^k Q_i$ of the first k test sets for all known relations.

4.5 Overall Performance Comparison

Tables 1 and 2 record the average accuracy (%) of the different CRE models on all learned tasks in the FewRel and TACRED datasets, respectively. Based on these results, the following conclusions can be drawn:

(1) Observing Tables 1 and 2, we can find that the method SS-CRE proposed in this paper achieves the highest average accuracy on several tasks, especially on the latter ones. It also demonstrates certain advantages compared with EMAR-ACA, indicating that improvements in semantic embeddings and relation prototypes can effectively enhance the performance of the model, making it more stable during continual learning.
(2) Comparing Tables 1 and 2, it can be found that all models perform significantly worse on the TACRED dataset. Analyzed from the dataset perspective, the primary reason for this could be the characteristics of the TACRED dataset, which include class imbalance and substantial semantic differences, making continual relation extraction tasks more challenging. However, compared with EMAR-ACA, the proposed method still results in improved model performance on the TACRED dataset (1.61%) and outperforms the class-balanced dataset FewRel (0.34%). This suggests that SS-CRE is more robust in class-imbalanced scenarios.
(3) Observing Table 2, it can be found that although CRL outperforms SS-CRE on T1 and T2, with the increasing number of tasks, SS-CRE is fully trained and the model performance degradation is smaller than that of CRL.This indicates that SS-CRE can better solve the catastrophic forgetting problem compared with CRL.
(4) Compared with CEAR, SS-CRE also has some advantages. Although the average accuracy of CEAR is higher than that of SS-CRE on the first few tasks, with the increase of the number of tasks, we eventually surpass CEAR and achieve higher results, indicating that SS-CRE is superior to CEAR in long-term memory.

Table 1 Comparative experimental results of the models on the FewRel dataset

Full size table

Table 2 Experimental results for each model on the TACRED dataset

Full size table

4.6 Ablation Study

To validate the effectiveness of the static relation prototype p^static and the supervised SimCSE-BERT, ablation experiments were conducted. Table 3 reports the average accuracy (%) of each model on the last five tasks on the TACRED dataset. In Table 3, " + p^static" indicates the addition of the static relation prototype to EMAR-ACA, while " + sup. SimCSE-BERT" denotes the replacement of the BERT encoder in EMAR-ACA with the supervised SimCSE-BERT encoder. Based on the experimental results in Table 3, the following conclusions can be drawn:

(1) Comparing EMAR-ACA with " + p^static", there is a slight improvement in model performance (0.54%). This suggests that incorporating information from the training samples when calculating relation prototypes can effectively reduce the reliance on memory samples, addressing the issue of the model being overly sensitive to memory samples. However, the performance improvement is not significant, which may be due to the class imbalance in the TACRED dataset. If there were originally few samples for a specific relation in the training set, the enhancement from adding the static relation prototype p^static might not be pronounced.
(2) Comparing EMAR-ACA and " + sup. SimCSE-BERT", a significant performance improvement (1.53%) is achieved by modifying the basic structure of the BERT encoder with supervised SimCSE-BERT. The reason behind this improvement may be that the semantic embeddings generated by the BERT encoder suffer from anisotropy and uneven distribution, which makes the cosine similarity function not a good measure of the similarity between the semantic embeddings and the relation embeddings. Moreover, the TACRED dataset is characterized by class imbalance, which will exacerbate the problem of uneven distribution of semantic embeddings. The supervised SimCSE-BERT encoder adopted by SS-CRE has a contrast learning framework, effectively addressing the limitations of the BERT encoder and enabling the model to obtain semantic embeddings with richer and more accurate information.

Table 3 Results of ablation experiments for each model on the TACRED dataset

Full size table

4.7 Effect of Memory Size

Memory size refers to the number of memory samples needed to be stored for each relation in CRE. In this section, we explore the effect of memory size on the performance of SS-CRE. Taking EMAR-ACA as a reference and keeping all configurations and task sequences consistent, we compared three memory sizes: 5, 10, and 20. The experimental results are shown in Fig. 3. From Fig. 3, we can obtain the following conclusions:

(1) Observing the overall trend in the line chart, it is evident that as memory size decreases, the model's performance gradually declines, indicating that memory size is an important factor affecting the model performance. However, SS-CRE maintains a higher average accuracy for all different memory sizes, suggesting that SS-CRE reduces the sensitivity of the model to memory samples.
(2) Observing Fig. 3, it becomes obvious that under the condition of the same memory size, SS-CRE's performance is more stable than EMAR-ACA. This indicates that SS-CRE enhances the ability to mitigate catastrophic forgetting.

5 Conclusion

This paper introduces a novel CRE method that enhances model performance by improving semantic embeddings and relation prototypes. Specifically, in the sample encoding phase, the use of supervised SimCSE-BERT in place of BERT addresses the issues of anisotropy and uneven distribution in semantic embeddings, resulting in more accurate sample information. In the memory consolidation phase, the static relation prototype is introduced to reduce the sensitivity of the model to memory samples by incorporating information from training samples. The performance of the model is improved on both widely used datasets, validating the effectiveness of the method in this paper. In future work, we will combine few-shot learning with continual relation extraction to make the model more applicable in practical scenarios.

References

Liu K (2020) A survey on neural relation extraction. In: Proceedings of the Science China Technological Sciences 63(10): 1971–1989
Ji S, Pan S, Cambria E, et al. (2021) A survey on knowledge graphs: representation, acquisition, and applications. In: Proceedings of the IEEE Transactions on Neural Networks and Learning Systems 33(2): 494–514
Mishra A, Jain SK (2016) A survey on question answering systems with classification. In: Proceedings of the Journal of King Saud University-Computer and Information Sciences 28(3): 345–361
Wang Z, Zhang Z, Lee CY, et al. (2022) Learning to prompt for continual learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 139–149
De Lange M, Aljundi R, Masana M, et al. (2021) A continual learning survey: defying forgetting in classification tasks. In: Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence 44(7): 3366–3385
Parisi GI, Kemker R, Part JL, et al. (2019) Continual lifelong learning with neural networks: a review. In: Proceedings of the Neural Networks, 113: 54–71
Good Fe Llow IJ, Mirza M, Xiao D, et al. (2013) An empirical investigation of catastrophic forgetting in gradient-based neural networks. In: Proceedings of the Computer Science, 84(12): 1387–91
Han X, Dai Y, Gao T, et al. (2020) Continual relation learning via episodic memory activation and reconsolidation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6429–6440
Wu T, Li X, Li YF, et al. (2021) Curriculum-meta learning for order-robust continual relation extraction. In: Proceedings of the AAAI Conference on Artificial Intelligence 35(12): 10363–10369.
Cui L, Yang D, Yu J, et al. (2021) Refining sample embeddings with relation prototypes to enhance continual relation extraction. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 232–243
Zhao K, Xu H, Yang J, et al. (2022) Consistent representation learning for continual relation extraction. In: Proceedings of the Association for Computational Linguistics, pp. 3402–3411
Hu CW, Yang DQ, Jin HL, et al. (2022) Improving continual relation extraction through prototypical contrastive learning. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 1885–1895.
Wang P, Song Y, Liu T, et al. (2022) Learning robust representations for continual relation extraction via adversarial class augmentation. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 6264–6278
Zhao W, Cui Y, Hu W (2023) Improving Continual Relation Extraction by Distinguishing Analogous Semantics. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pp. 1162–1175.
Gao T, Yao X, Chen D (2021) Simcse: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910
Mintz M, Bills S, Snow R et al. (2009) Distant supervision for relation extraction without labeled data. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 1003–1011
Sun Q, Huang K, Yang XC et al. (2023) Uncertainty guided label denoising for document-level distant relation extraction. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15960–15973
Bekoulis G, Deleu J, Demeester T, et al. (2018) Joint entity recognition and relation extraction as a multi-head selection problem. In: Proceedings of the Expert Systems with Applications 114: 34–45
Fu TJ, Li PH, Ma WY (2019) Graphrel: modeling text as relational graphs for joint entity and relation extraction. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, pp. 1409–1418
Wang Y, Sun C, Wu Y et al. (2021) UniRE: a unified label space for entity relation extraction. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 220–231
Qin P, Xu W, Wang WY (2018) DSGAN: generative adversarial training for distant supervision relation extraction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 496–505
Jung W, Shim K (2020) Dual supervision framework for relation extraction with distant supervision and human annotation. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 6411–6423
Kirkpatrick J, Pascanu R, Rabinowitz N et al. (2017) Overcoming catastrophic forgetting in neural networks. In: Proceedings of the National Academy of Sciences, 114(13): 3521–3526
Dhar P, Singh RV, Peng KC, et al. (2019) Learning without memorizing. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5138–5146
Serra J, Suris D, Miron M, et al. (2018) Overcoming catastrophic forgetting with hard attention to the task. In: Proceedings of the 35th International Conference on Machine Learning, pp. 4548–4557
Rusu AA, Rabinowitz NC, Desjardins G et al.2016)Progressive neural networks. arXiv preprint arXiv: 1606–04671
Lopez-Paz D, Ranzato MA (2017) Gradient episodic memory for continual learning. In: Proceedings of the Neural Information Processing Systems, pp. 6467–6476
Wang H, Xiong W, Yu M, et al. (2020) Sentence embedding alignment for lifelong relation extraction. In: Proceedings of the 2019 Conference of the North. Minneapolis, pp. 796–806
Sun FK, Ho CH, Lee HY (2020) Lamol: Language modeling for lifelong language learning. In: Proceedings of the 8th International Conference on Learning Representations
de Masson D'Autume C, Ruder S, Kong L, et al. (2019) Episodic memory in lifelong language learning. In: Proceedings of the Neural Information Processing Systems, pp. 13122–13131
Han X, Zhu H, Yu P, et al. (2018) Fewrel: a large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4803–4809
Zhang Y, Zhong V, Chen D, et al (2017) Position-aware attention and supervised data improve slot filling. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 35–45

Download references

Funding

This work was supported by the National Natural Science Foundation of China under Grant 61971339 and 61471161, the Natural Science Basic Research Program of Shaanxi (No. 2023-JC-YB-568), the Scientific Research Program Funded by Shaanxi Provincial Education Department (No. 22JP028), the Joint Fund by Shaanxi Computer Society & Xi'an Xiangteng Microelectronics Technology Co., Ltd. (XT-QC-202309–119287), and the Science and Technology Guidance Program of China National Textile and Apparel Council (2020100).

Author information

Authors and Affiliations

The Shaanxi Key Laboratory of Clothing Intelligence, School of Computer Science, Xi’an Polytechnic University, Xi’an, 710048, China
Jinguang Chen, Suyue Wang, Lili Ma, Bo Yang & Kaibing Zhang

Authors

Jinguang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Suyue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lili Ma
View author publications
You can also search for this author in PubMed Google Scholar
Bo Yang
View author publications
You can also search for this author in PubMed Google Scholar
Kaibing Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Jinguang Chen: Methodology-Proponents of major academic ideas and supervision. Suyue Wang: major academic ideas, Writing - original draft, Revised version preparation. Lili Ma: Experiment- Data processing, Figure plotting. Bo Yang: Writing-review and editing. Kaibing Zhang: Writing-Polishing the English presentation.

Corresponding author

Correspondence to Jinguang Chen.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interests.

Ethical and Informed Consent for Data Used

Not applicable.

Data Availability

The raw data elaborated during the current research are publicly available.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A. Experiment for Hyperparameter Selection

In order to obtain the appropriate value of hyperparameter β, the optimal ratio of dynamic and static relation prototypes, to get the relation prototypes adapted to the feature space, we conducted a series of experiments in this paper. Table

Table 4 Adjusting the value of the hyperparameter β, the experimental results of the SS-CRE on T10

Full size table

4 shows the experimental results obtained by adjusting the value of the hyperparameter β for our method SS-CRE on the FewRel and TACRED datasets. Since the experiments are conducted to evaluate the model using the concatenated set of all the learned task test sets, we mainly choose the value of the hyperparameter β by observing the average accuracy of T10 in this paper. And observing Table 4, we can find that the average accuracy of the model on T10 is the highest when β = 0.2, so we set the value of β to 0.2 in Sect. 4.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Chen, J., Wang, S., Ma, L. et al. SS-CRE: A Continual Relation Extraction Method Through SimCSE-BERT and Static Relation Prototypes. Neural Process Lett 56, 203 (2024). https://doi.org/10.1007/s11063-024-11647-4

Download citation

Accepted: 08 May 2024
Published: 20 June 2024
DOI: https://doi.org/10.1007/s11063-024-11647-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

SS-CRE: A Continual Relation Extraction Method Through SimCSE-BERT and Static Relation Prototypes

Abstract

Similar content being viewed by others

Impact of word embedding models on text analytics in deep learning environment: a review

Using word embedding to detect keywords in texts modeled as complex networks

A survey of graph neural networks in various learning paradigms: methods, applications, and challenges

1 Introduction

2 Related Work

2.1 Relation Extraction

2.2 Continual Learning

3 Methodology

3.1 Task Formalization and Overall Framework

3.1.1 Task Formalization

3.1.2 Overall Framework

3.2 The Supervised SimCSE-BERT Encoder

3.3 Learning for New Tasks

3.4 Selecting Typical Samples for Memory

3.5 Replay, Activation, and Reconsolidation

3.5.1 Static Relation Prototypes

3.5.2 Memory Replay and Activation

3.5.3 Memory Reconsolidation

3.6 Prediction

4 Experiments

4.1 Datasets

4.2 Compared Models

4.3 Experimental Setting

4.4 Evaluation Metrics

4.5 Overall Performance Comparison

4.6 Ablation Study

4.7 Effect of Memory Size

5 Conclusion

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interests

Ethical and Informed Consent for Data Used

Data Availability

Additional information

Publisher's Note

Appendix A. Experiment for Hyperparameter Selection

Appendix A. Experiment for Hyperparameter Selection

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation