Introduction

Relation extraction (RE) aims to extract explicit and implicit entity–relations from unstructured text and transform it into a structural knowledge base. It is an essential component of natural language processing (NLP) tasks, such as question answering (QA) [1], text summarization [2], and knowledge graph construction [3].

Fig. 1
figure 1

Examples of entity-relations in real-world scenarios. Similar to dependency parsing trees, we establish interconnections between entities using dependency relations to represent their associations. Relations among the entities are illustrated above the context with directed, labeled arcs from heads to dependents. ROOT indicates a special node that explicitly marks the head of the all entity-relations

Current approaches to relation extraction can be roughly divided into two categories: (1) pipelined approaches that decompose into named entity recognition and relation classification separately [4, 5]; and (2) joint approaches that perform two subtasks simultaneously in a joint learning manner [6,7,8]. However, as mentioned by Zeng et al. [9], these models cannot handle the overlapping triple problem. As shown in Fig. 1, multiple relational triples share the overlapped entity “Frances E. Allen”. To address the above limitations, subsequent works introduced various strategies, such as tagging-based methods [10,11,12], span-based methods [13, 14], and table-filling-based methods [15, 16]. Though the above approaches have achieved promising success, they are extractive paradigms, making it difficult to deal with three key challenges.

First, there exist hierarchical dependency entity–relations between different tags in real-life scenarios, especially in scholar profiling task. Traditional entity–relation methods may not adequately capture the complexity and richness of these relation. As shown in Fig. 1, a model may easily obtain (Robert M. Metcalfe, obtain, PhD) triple. However, this triple is not enough to fully express that “PhD are obtained from Harvard University”. This inadequacy arises from the inherent dependency relation between the head entity “Harvard University” and the dependent entity “PhD”. Existing studies on this issue usually consider relation extraction as a multi-turn question answering problem [17]. This idea employs an extractive machine reading comprehension (MRC), which predicts the start and end positions of the answer span given the context text. However, its success heavily relies on annotation tools to label accurate information and required expensive labeling costs.

Second, implicit entity–relations may be contained in a context. Most methods can only extract entities that appear in the context, but implicit entity–relations cannot be inferred. Take Fig. 1 as an example, Robert M. Metcalfe entered MIT in 1964, where the year 1964 is inferred from the logical relations according to “after five years”. Moreover, based on the given context, it is reasonable to infer that Robert M. Metcalfe initiated his master’s degree at Harvard University around 1969 and pursued his Ph.D. in 1970, respectively.

Third, an entity has different expressions and needs to be normalized. Take the context in Fig. 1 for example, an extractor easily identifies the entity MIT, but it is difficult to be normalized the entity Massachusetts Institute of Technology unless the extractor incorporates a reference set. Although common in entity linking, obtaining the normalized entities are particularly difficult for entity–relation extraction where entities are not explicitly mentioned in the context.

To address the above limitations, we propose a novel paradigm, GenRE, to perform entity–relation extraction, which can generate an entity–relation from the context based on generative multi-turn question answering with contrastive learning. Specially, to make it easy for GenRE to move from one domain to the other, a template-based question prompt generation is first designed to answer in different turns. We then formulate entity–relation extraction as a generative QA task to generate explicit and implicit entity–relations using the general language model (GLM) [18] instead of extractive question answering, where we introduce a special candidate answer “unknown” to address the early stop problem caused by unanswerable questions in multi-turn QA. In this way, we not only answer questions when possible but also determine when no answer is supported for unanswerable questions in the given context. Meanwhile, we introduce self-supervised contrastive learning to improve the faithfulness of the generated answers, which guides the model to increase the distance between the positive and negative samples. Figure 2 provides an example and overview of our approach. The key insight of GenRE is to simulate humans learn knowledge by getting to the bottom of it.

Fig. 2
figure 2

Examples and overview of our GenRE. Each turn contains a question and an answer

To summarize, our main contributions are as follows:

  • We introduce a novel framework that solves relation extraction tasks by casting them as a generative multi-turn QA problem that take into account the rich semantic information of hierarchical dependency relations.

  • By incorporating a contrastive learning, the performance of the model in discriminating between positive and negative answers is significantly improved.

  • We conduct extensive experiments on three datasets, including two versions of two public datasets and a custom dataset, to verify the effectiveness and flexibility of our approach, especially in entity normalization and inference.

The remainder of this paper is organized as follows. The related work is summarized in “Related works” section. We formalize the RE task and present the proposed GenRE model in detail in “Preliminaries” section. A series of experiments are conducted to evaluate the performance of GenRE in “Experimental setups” and “Results and discussion” sections. Finally, we conclude our work in “Conclusion” section.

Related works

Relation extraction

Traditionally, relation extraction has been tackled using separate subtasks in a pipeline manner for name entity recognition and relation classification [5, 19, 20]. However, pipelined systems suffer significantly from error propagation problems. To overcome this problem, joint learning have been proposed to exploit the interrelation between the two tasks [8, 21, 22].

Various strategies exist for joint learning methods, such as treating the two tasks as a sequence-labeling problem, as suggested by some researchers [8, 23]. Wei et al. [10] utilized a cascaded binary tagging framework to extract subject entities and their corresponding relationship and object entities in two stages. Despite their initial success, these methods cannot precisely identify overlapping triples in a sentence, because there exists a phenomenon in which an entity pair may have multiple relations or that two relation triples share an overlapping entity. Ren et al. [12] further improved and designed a bidirectional extraction framework [10] reduce entity extraction omissions.

A series of studies have formulated joint entity and relation extraction as a table-filling problem. Wang et al. [15] designed a table encoder and a sequence encoder that collaborate to facilitate representation learning. TPLinker [24] is a single-step model for jointly extracting entities and overlapping relations, where a handshaking tagging scheme is proposed. UNIRE [25] presented a novel table-filling approach, where entities and relations are represented as squares and rectangles. Ren et al. [16] proposed a joint extraction method while considering global information to improve table modeling. OneRel [26] is similar to TPLinker [24], the difference is that the number of matrices is reduced and the model efficiency is improved.

The span-based approach is another widely used method for joint entity and relation extraction. Markus et al. [27] introduced a novel approach, SpERT, a span-based model for joint entity–relation extraction. Zhong et al. [14] presented a simple approach which learns two independent encoders for entity recognition and relation extraction.

Recent some efforts have focused on identifying overlapping triples, which usually introduce various strategies like seq2seq with copy mechanisms [9, 28, 29], graph convolutional networks [22, 30, 31], and reinforcement learning [32, 33]. Unfortunately, these approaches may suffer from semantic dependencies, as they still unambiguously predict the relationship between each pair of entities, ignoring the effects of other entity–relations.

Another interesting thread is to cast information extraction as an MRC task [17, 34,35,36]. Li et al. [17] adopted a similar idea to frame relation extraction as an MRC problem, which cannot only identify overlapping triples but also deals with hierarchical dependencies among multiple triples. Nevertheless, this approach relies on identifying entity spans, making it difficult to deduce implicit relationships between entities.

More recently, researchers have proposed utilizing generative pre-trained models, such as BART [37] and T5 [38], for relation extraction. TANL [39] treated the task as a translation problem between augmented natural languages, while CGT [40] and REBEL [41] framed triple extraction as a sequence-generation task. UIE [42] introduced a unified text-to-structure generation framework that dynamically generates target extractions via a schema-based prompt mechanism. Our approach is similar to these methods in that we use generative pre-trained models, but it distinguishes itself by utilizing a multi-turn QA method to generate entity–relations with hierarchical dependencies.

Pre-trained language models

Recently, pre-trained language models have achieved great success in the field of NLP. Vaswani et al. [43] proposed a self-attention-based architecture Transformer, which soon became the backbone of many subsequent pre-trained language models by pre-training on a large-scale corpus. Existing pre-training language models can be categorized into three types: First, autoencoding models [44, 45] learn a bidirectional contextualized encoder for natural language understanding (NLU) via denoising objectives, which are suitable for NLU tasks but cannot be directly applied to text generation. The second type is autoregressive models, which are trained with a left-to-right language-modeling objective [46, 47]. It performs well in unconditional generation tasks and can be applied to NLU and conditional generation tasks. Third, encoder–decoder models are pre-trained for sequence-to-sequence tasks [37, 38]. These types of models are typically deployed in conditional text generation and can be applied to NLU and unconditional generation tasks. However, these pre-training frameworks are not suitable for all NLP tasks. Recently, Du et al. [18] proposed a general language model based on autoregressive blank-filling to address this challenge.

Contrastive learning

Contrastive learning for self-supervised, semi-supervised, and supervised learning methods have been widely used to learn representations by contrasting positive pairs against negative pairs, especially in computer vision [48]. Recently, contrastive learning has been used in NLP to train language models with self-supervision [49], learn sentence representations [50], and improve machine translation [51]. A paradigm that introduces self-supervised contrastive loss when fine-tuning a pre-trained model has been used for conditional text-generation tasks [52], including machine translation, question generation, and summarization. Inspired by these studies, we incorporated contrastive learning into generative QA.

Preliminaries

This section formally details the relation extraction task and then introduces generative multi-turn QA techniques with contrastive learning based on GLM pre-trained language models. Figure 3 describes the architecture of GenRE, which consists of two stages. (1) The training stage in which we train the generative QA model based on GLM while utilizing contrastive learning. (2) The inference stage in which we generate the answers based on the generative QA model in a multi-turn QA method and then integrate the answers into a structural knowledge base.

Fig. 3
figure 3

Architecture of our approach. The left modules represent the training process based on GLM with contrastive learning and the right modules represent the inference phase in multi-turn QA

Problem formalization

Before presenting our model, we first formalize the entity–relation extraction task. Formally, given context \(C={c_1,c_2,\ldots ,c_n }\) with n tokens, let E denote the set of entities included in C. Entity \(e_i\in E\) consists of one or multiple consecutive words in the context. Let \(r_{i,j}\) indicate one type of relationship between entity pairs \((e_i,e_j)\) and \(r_{i,j}\in R\), where \(R={r_1,r_2,\ldots ,r_m}\). Hence, the traditional RE task is to extract the explicit and implicit relation triples \(T={(e_i,{r_{i,j},e}_j)}\) between entity pairs \((e_i,e_j)\). However, the relation \(r_{i,j}\) may depend on another entity \(e_k\) owing to hierarchical dependency in one context.

Table 1 Template-based question generation for RE in different scenarios

Inspired by MRC-based RE, we treated RE as a multi-turn QA problem and solved the task based on a generative model with contrastive learning. Specifically, the extraction of explicit and implicit relations is transformed into a multi-turn QA. We formally define the generative QA problems. Given a context \(C={c_1,c_2,\ldots ,c_n}\) with n tokens and a question \(Q={{q}_1,q_2,\ldots ,q_m}\) with m tokens, the model must generate target answers \(A={{a}_1,a_2,\ldots ,a_l}\) with l tokens for answerable or unanswerable questions. Note that the “unknown” string is returned when the question is unanswerable.

We summarize how to transform the entity–relation extraction task into a multi-turn generative QA task in Algorithm 1.

Algorithm 1
figure a

Transforming the entity–relation extraction task into a multi-turn generative QA task.

Template-based question prompt generation

For each turn of QA that we would like to predict, we need to generate an appropriate question q. To make it easy for GenRE to move from one domain to the other, we designed a question generation template that is not only effective for the task but also be simple and efficient to implement.

For the normal relation extraction task, we transformed it into two turns of QA. Specifically,the first turn involves identifying all candidate entities (entity taggers), while the second turn involves classifying the relation types between each possible entity pair (relation extractor). However, in some specific scenarios, RE may need to be treated as a multi-turn QA task. In this case, the head entities with the highest priority are extracted in the first turn QA, and then fed to the question template to generate the second turn questions to obtain tail entities and relations, and so on, until we generate the answers in a one-to-one correspondence.

Table 1 presents the question generation templates for different scenarios. It is worth noting that each question may have multiple answers that are necessary to iteratively generate the next-turn question.

Generative QA model with contrastive learning

In a generative QA task, the goal is to generate answers by autoregressively predicting tokens, where every answer has its corresponding rationale. The rationale can usually be located in a certain continuous area of the context or inferred from semantic reasoning based on the context. In this paper, we consider explicit and implicit answers to obtain complete entity–relations.

Inspired by the current trend of formulating NLP tasks as generation tasks [18], we proposed a generative QA model based on GLM with contrastive learning to obtain more complete entity–relations. Figure 3(1) illustrates the training process of our model. The GLM is a general language model pre-trained with an autoregressive blank-filling objective. It has shown superior performance when compared to state-of-the-art models in various NLP tasks, such as NLU, unconditional generation, and conditional generation. We used the pre-trained GLM as the main structure for the generative QA model.

Learning representations

Suppose that given a context \(C={c_1,\ldots ,\ c_n}\), a question \(Q=\left\{ q_1,\ldots ,\ q_m\right\} \) as the source text \(x=\left\{ x_1,\ldots ,\ x_{n+m}\right\} \), and an answer \(y=\left\{ y_1,\ldots ,\ y_o\right\} \) as the target output text, where n,m,o indicate the number of words in the context, question, and answer, respectively. To align with the GLM framework, we must convert the input and target text into word tokens.

Fig. 4
figure 4

Input embedding for fine-tuning GLM

To achieve this, we introduce the prompt tokens “\(<\textit{Context}>\)”, “\(<\textit{Question}>\)”, “\(<\textit{Answer}>\)” at the beginning of sequence representations of context, question, and answer, respectively. Figure 4 shows an illustrative example of single-turn QA and its representation of the source and target text. For each token, the input embedding comprises token embedding \(E_{T}\), position embedding \(E_{P}\), and block position embedding \(E_{B}\).

Fine-tuning GLM for answer generation

Taking the above learning representation tokens as the input representations, the GLM first computes the hidden states of the input via multi-layer L transformer layers, where each layer consists of a multi-head self-attention layer and a fully connected feedforward network

$$\begin{aligned} h_{0}= & {} E_{T}+E_{P}+E_{B} \end{aligned}$$
(1)
$$\begin{aligned} h_{l}= & {} \text {transformer}(h_{l-1}) \forall l \in [1,L]. \end{aligned}$$
(2)

In contrast to other models, the number of missing tokens in a span is unknown to the model, and a span may contain multiple tokens. Therefore, the GLM predicts the answers indicated [MASK] in an autoregressive manner and generates the token in the masked spans following the left-to-right order. For source sequence x, the probability of generating the target [MASK] span y is expressed as follows:

$$\begin{aligned}{} & {} P_{\theta }\left( y\vert X,y_{<i}\right) =\prod _{j=1}^{0} P(y_{j}\vert x,y_{z<i}) \end{aligned}$$
(3)
$$\begin{aligned}{} & {} P(y_{j} \vert x,y_{z<i})=\text {softmax}(W h_{L}), \end{aligned}$$
(4)

where \(y_{z_{<i}}\) denotes the first \(i-1\) elements of a permutation \(z \in Z_{T} \), \( Z_{T} \) is the set of all possible permutations of the length \(-T\) index sequence \([1,2,\ldots ,T]\). \(y_j\) corresponds to the jth token of y and \(h_L\) represents the Lth hidden state.

Contrastive learning for faithful answers

Generative models have become increasingly popular in natural language processing, particularly in text generation. However, a major challenge for these models is that they are generally trained with teacher forcing,which involves providing ground-truth answers at each time step during training. In other words, the generative models is not exposed to negative examples and may not learn to distinguish between correct and incorrect answers. This issue is commonly referred to as the “exposure bias” problem.

In the QA task, exposure bias can result in incorrect or insufficient answers.For example, given that the input sentence “Chris was educated at Plymouth University in the UK and holds an honor degree in Geography.” and the question “Where was Chris educated?”, we should expect the model to generate the answer “Plymouth University” rather than “UK”, which is a correct answer but not what we expect. Therefore, we need to find a way to enhance the faithfulness and accuracy of the generated answers.

To address this problem, contrastive learning has emerged as a promising solution. This approach has been successful in militating the “exposure bias” problem by increasing the distance between positive and negative samples. In light of this, we introduce a naïve contrastive learning framework to train the GLM model,which incorporates in-batch non-target sequences as negative examples, as illustrated in Fig.  3(1).

Our approach involves considering the answer \(y^{(i)}\) as the positive instance and adopting in-batch sampling to obtain the negative instance \({{y}^{\left( j\right) }},\ \forall j\ne i \). First, we train the GLM model to learn a joint embedding space. Then, we apply average pooling to \(X^{\left( i\right) }\) and \(Y^{\left( i\right) }\) and project them into a two-dimensional space using t-SNE. Our objective is to maximize the similarity between the source and the N true target sentence embeddings in the batch, while minimizing the similarity between the source and the remaining \({N}^{2}-N\) negative pairs. We use cross-entropy loss to optimize the similarity scores, as shown in the following loss function:

$$\begin{aligned} {\mathcal {L}}_{cl}= & {} -\sum _{i=1}\frac{\exp (\text {sim}(z_{x}^{(i)},z_{y}^{(i)})/\tau )}{ {\textstyle \sum _{Z_{y}^{(i)}\in D}\exp (\text {sim}(z_{x}^{(i)},z_{y}^{(i)})/\tau )}} \end{aligned}$$
(5)
$$\begin{aligned} z_{x}^{(i)}= & {} \text {AvgPool}(H_{x}^{(i)}),X^{(i)}=[X^{(i)}_{1},...X^{(i)}_{S}] \end{aligned}$$
(6)
$$\begin{aligned} z_{x}^{(i)}= & {} \text {AvgPool}(Y^{(i)}),Y^{(i)}=[Y^{(i)}_{1},...Y^{(i)}_{T}], \end{aligned}$$
(7)

where \(z_*^{(i)}\) denotes the application of mean pooling to compute the fixed-sized representation of a sentence \(z\in {\mathbb {R}}^d\), \(X^{\left( i\right) }\),\(Y^{\left( i\right) }\) denotes a concatenation of the hidden states of the source text x and target text y, respectively. \(\text {sim}(\cdot ,\cdot )\) calculates the cosine similarity between the two representations. Furthermore, \(D=\{z_y^{\left( j\right) }:j\ne i\}\) is a series of hidden representations of negative targets that are in-batch sampled and are not paired with the source text \(x^{(i)}\), where \(\tau \) is the temperature set to 1.0.

Joint generative and contrastive training

In the fine-tuning stage, given a training dataset, we formulate the cross-entropy loss function of this QA task as follows:

$$\begin{aligned} {\mathcal {L}} _{ce} = -\sum _{i=1}^{m} logP_{\theta }(Y \vert X,y_{<i}). \end{aligned}$$
(8)

Therefore, during the training of our model, it can be optimized by minimizing the cross-entropy loss of the generation and contrastive training loss

$$\begin{aligned} loss={\mathcal {L}}_{ce}+\lambda {\mathcal {L}}_{cl}, \end{aligned}$$
(9)

where \(\lambda \) is a hyperparameter that controls the weight of the generative QA tasks. During the training process, we used a linear decay schedule on the value of \(\lambda \), to rely more on contrastive learning to generate faithful answers at the early stage, followed by a subsequent focus on the target generation task.

Constrained decoding for answer generation

The quality of the generated text in autoregressive language models s influenced by several factors, and one of them is the decoding strategy. This strategy is employed to select the next word to generate based on the probability of the entire vocabulary. In this study, we explore three constrained decoding methods.

In greedy decoding, the decoder generates the token with the highest probability given all previous tokens at each step. In search-based decoding with beam search, the decoder keeps track of the num_beam most probable sequences and finds a better one at each step, where num_beam is the number of sequences of tokens to track as candidate completion sequences. In sampling decoding, the decoder samples from either the top_k number of tokens or the number of tokens corresponding to the top_p probability.

In the decoding strategies mentioned above, the decoder generates tokens from an entire vocabulary. However, for explicit entity–relation extraction, the answers presented in the context are considered valid. By limiting the generation space to valid tokens, we can prevent the model from generating invalid tokens, which ultimately improves accuracy. Therefore, we experimented with two kinds of constrained generation, namely, context-constrained generation and vocabulary-constrained generation using different decoding strategies.

Experimental setups

Dataset

To fairly evaluate the performance of the proposed model, we conducts the experiments on two tasks: the traditional relation extraction task and domain-specific structured prediction task education information extraction. For the former, we leverage two popular public relationship extraction datasets known for the presence of overlapping relationships. For the latter, a more complex custom dataset is employed, incorporating instances of both overlapping and dependency relations.

Relation extraction

To extract overlapping relation, we conducted experiments on two popular datasets, NYT [53] and WebNLG [54] to evaluate the proposed framework and all baselines. The NYT was constructed using a distant supervision method and is widely used for relation extraction. It contains 24 relations, 56,195 sentences for training, 5000 sentences for validation, and 5000 sentences for testing. WebNLG was originally created for natural language generation but was later used by Zeng et al. [9] for triplet extraction. It contains 171 relations and 5019/500/703 sentences for training, validation, and testing, respectively. Note that both NYT and WebNLG have two different versions according to the following two annotation criteria: (1) the last token of the annotated entity and (2) the span of the entire entity. We evaluated our model using different versions of these datasets for a fair comparison. The first versions are denoted as \(\hbox {NYT}^*\) and \(\hbox {WebNLG}^*\), and the second versions are denoted as NYT and WebNLG, respectively.

Education information extraction

For scholar profiling tasks, current entity–relation extraction methods cannot extract more specific and complex entity–relations from text. To support this task, we constructed a Profiling-Edu dataset from the Aminer system, which contains more than 300 million scholars. We sampled the education records of 2053 scholars to annotate start-date, end-date, university, degree, and major. The dataset was randomly split into a testing set of 500 samples and a training set for the remaining samples.

Note that we treat each task as a generative multi-turn QA task, and all the datasets used for relation extraction cannot be directly used in QA-based models. Hence, some preprocessing is performed to construct QA pairs according to the given context. Concretely, following the question generation method outlined in the “Template-based question prompt generation” section, we obtain a relation extraction dataset that can be used for QA. Our approach differs from other methods that tag the start and end position of the answer. Instead, our model directly provides the answer itself. In cases where the answer is unknown or not present in the given context, we denote it as “unknown”.

Evaluation metric

For traditional relation extraction, we followed the popular choice report micro-F1 scores, precision, and recall on entities and relations for evaluation.

In the domain-specific relation extraction task, the Profiling-Edu dataset contains multiple items that need a one-to-one correspondence between the items. Therefore, we adapted the adjusted precision, recall, and F1-score to evaluate the results of the education information extraction

$$\begin{aligned} P= & {} \frac{ {\textstyle \sum _{j=1}^{m}} {\textstyle \sum _{i=1}^{k}} \frac{x_{i}}{k} }{m}\end{aligned}$$
(10)
$$\begin{aligned} R= & {} \frac{ {\textstyle \sum _{j=1}^{m}} {\textstyle \sum _{i=1}^{k}} \frac{x_{i}}{k} }{n}\end{aligned}$$
(11)
$$\begin{aligned} F1= & {} \frac{2*P*R}{P+R}, \end{aligned}$$
(12)

where m is the number of extracted education records, n is the number of ground-truth education records, and k denotes the number of items in the record. If an item is consistent with the annotated item, \(x_i\) receives a value of 1; otherwise, it is assigned a value of 0.

Implementation details

We adapted GLM-Doc,Footnote 1 which has 24 layers, 1024 hidden units, and 16 attention heads, as the MRC backbone. We optimized our model using label smoothing and used AdamW optimization with \(\beta _1=0.9\) and \(\beta _2=0.999\). The batch size was 64 and the learning rate was 2e-5, with a weight decay of 1e-1. We applied a linear warm-up learning rate scheduler, with a warm-up ratio of 0.06. We trained our model for a maximum of 50 epochs. All experiments were performed on an Intel(R) Xeon(R) Gold 6240 CPU and an NVIDIA V100 32 GB GPU.

Comparison methods

To demonstrate its effectiveness, we compared our method with several baselines as follows, which can be summarized in two groups: extractive methods and generative methods.

  1. 1.

    Extractive methods

    • NovelTagging [8] introduced a novel tagging scheme and modeled the relational triple extraction problem as a sequence-labeling problem.

    • CopyRE [9] adapted a seq2seq model with a copy mechanism that can effectively extract overlapping triples in a sentence.

    • GraphRel [30] is a two-stage model based on a graph convolutional network (GCN) for jointly learning named entities and relations.

    • RSAN [23] proposed a sequence-labeling approach that utilizes a relation-specific attention mechanism.

    • CasRel [10] employed a cascade binary tagging framework, which first extracts all possible head entities in a sentence with span-based MRC. Then, for each head entity, all possible relations and corresponding tail entities are identified.

    • TPLinker [24] iterates all token pairs and uses matrices to tag token links to recognize the relations between token pairs.

    • RIFRE [31] proposed the construction of heterogeneous graphs for iterative representation fusion by treating relations as nodes on the graph and applying them to relation extraction tasks.

    • TransRel [55] proposed a novel translation based framework, which contains an entity tagger and a relation extractor.

  2. 2.

    Generative methods

    • TANL [39] frames structured prediction language tasks as a task of translation between augmented natural languages, which makes it easy to encode structured information in the input and decode the output text into structured information.

    • CGT [40] treats triple extraction as a sequence-generation task and employs contrastive training to generate faithful triplets.

    • REBEL [41] frames triple extraction into a seq2seq task and leverages the BART as the base model.

    • UIE [42] introduced a unified text-to-structure generation framework that adaptively generates target extractions via a schema-based prompt mechanism. For a fair comparison, we fine-tune the UIE using T5-v1.1-large as backbone to in the following experiments.

Table 2 Main results on WEBNLG and NYT(%)

Results and discussion

In this section, we introduce the experimental results on the WebNLG and NYT datasets for the relation extraction task and the profiling education dataset for education information extraction. In addition, the performance of our models was demonstrated through detailed analysis and discussion.

Main result

Relation extraction

Table 2 presents the performance of all the baselines and our model on the WebNLG and NYT datasets. According to the results, our model achieved competitive performance within all evaluation metrics of the best baselines on all datasets. It is important to consider whether it is a generative model, since generative models achieved much better results on fully annotated labels datasets, and slightly inferior but still competitive results on partially matched datasets. The reason for this may be that generative models are more suitable for handling semantic integrity and are less prone to ambiguity for incomplete annotations. For example, given the instance “Joe Buck’s father is Jack Buck.”, the exact matched triple is (Jack Buck, Children, Joe Buck), whereas the partially matched triple is (Buck, Children, Buck). Thus, the generative model may not know which Buck is the head entity because of the ambiguity of incomplete annotations. This is meaningful, because it indicates that generative models perform well when deployed in real scenarios.

Compared with UIE [39], a model that frames all IE tasks text-to-structure transformations, our model achieves an absolute improvement of 1.8 in precision-score on the NYT datasets. The UIE can handle overlapped triples and is flexible, but it cannot deal with redundant predictions. However, our framework can effectively reduce redundant predictions by filtering a candidate relation “unknown”. Moreover, our model performs much better than other generative models on the NYT* dataset, which is because our model transforms relation extraction into a multi-turn QA task.

Education information extraction

Considering that the Profiling-Edu dataset constructed by us has cases where there is a hierarchical dependency between the entities, we also conducted experiments on it and report the results in Table 3. In contrast to the above datasets, the Profiling-Edu dataset has five types of entities, and multi-turn QA is required, as illustrated in Table 1. We further compared our model with four baselines: MQARE [17], TANL [39], REBEL [41], and UIE [42]. Specifically, MQARE casts entity–relation extraction is an MRC task that utilizes BERT as the backbone. TANL is a seq2seq model that frames structured prediction language tasks as tasks of translation between augmented natural languages. According to the results, our model and MQARE significantly outperform TANL, because the MRC methods can fully model the rich interactions between entities and relationships and can be generalized to new scenarios. Compared with MQARE, our model still achieves competitive performance, because we adopt a generative method, which employs a special candidate answer “unknown” that can bridge the gap between no-answer and wrong answer and cannot affect the answer of the next turn. Although the two models REBEL and UIE have high performance in the custom dataset, our approach GenRE performs much better than them. These results verify the effectiveness of the proposed model.

Table 3 Results on the custom dataset (%)
Table 4 Ablation study (%)
Table 5 Test GenRE with three different decoding strategies (%)

Detailed analysis and discussion

Ablation study

To better understand the effectiveness of our model, we conducted ablation studies by removing the key modules individually. As shown in Table 4, there is an obvious performance gap when removing question prompts, indicating that question prompt generation plays an important role in generating high-quality answers, with an average drop of 4.54% in F1. To investigate the effects of our learning model, we further ablated contractive learning loss. It can be observed that the model performance inevitably decreases, demonstrating that contrastive learning helps boost performance compared to the corresponding model versions without an additional objective.

Fig. 5
figure 5

F1 scores of the different on the different datasets

Effect of contrastive loss margin

Next, we investigate the effect of the contrastive learning loss margin \(\lambda \) in our framework, which controls the balance between the generation loss and contrastive learning loss. To this end, we fine-tuned the GLM by varying the value of \(\lambda \) from 0.1 to 1.0 and measured the evaluation metric. The results are shown in Fig. 5. When \(\lambda \) is set to 0, it refers to the model with only a cross-entropy loss. Interestingly, contrastive learning always helps boost performance compared to the corresponding model versions without the additional objective. This indicates that the proposed model can generate semantically valid answers that are beneficial for training the QA model. In addition, we observed that the best performance was achieved when \(\lambda =0.6\) for the relation extraction task.

Effect of constrained decoding

In this section, we provide a more in-depth comparison between context-constrained generation and vocabulary-constrained generation with different decoding strategies, including greedy decoding, beam search, top-k, and top-p. Table 5 presents the F1-score of GenRE using the decoding strategy on the Profiling-Edu dataset discussed in “Constrained decoding for answer generation” section. Intuitively, the differences between several encoding strategies are marginal in the context-constrained generation case. This highlights the suitability of context-constrained generation is more suitable for explicit entity–relation extraction. Additionally, we observed that the beam-search decoding strategy brings more improvements than the other strategies in either case and can effectively guide answer generation. This suggests that it is suitable for answer generation, particularly for implicit entity–relation extraction.

Table 6 Inference comparisons of different models (%)

Inference of GenRE

One straightforward solution for RE is to extract explicit entity relations in context. However, in real-life scenarios, a sentence may contain implicit entity–relations. To verify that our model has the ability to reason, we further conducted detailed experiments on the Profiling-Edu dataset, which adds implicit entity–relations on start-date and end-date. Results are provided in Table 6. We observe that GenRE performs the best among the models, and our GenRE model outperforms REBEL and UIE by 16.3% and 2.3%, respectively, in the F1 evaluation metric with implicit entity relations. This shows that the generated answers are strongly correlated with rationales, demonstrating the inference effectiveness of leveraging the GenRE model.

Normalization of GenRE

For information retrieval or question answering, named entities are expected to be normalized, which refers to the process of mapping different names refer to the same entity. For example, MIT refers to Massachusetts Institute of Technology and should be normalized. However, traditional BERT-based MRC models extract entities by predicting the start and end indexes, which cannot directly normalize the extracted entities but require further operations. In contrast, the proposed generative model can normalize the generated entities, providing a more direct solution to this issue.

To verify the effectiveness of normalization, we conducted a detailed analysis on the Profiling-Edu dataset where the university tag was normalized, such as \(<\textit{person}, \textit{edu\_at}, \textit{univ}>\). Table 7 presents the results of the study. As can be seen, the normalization ability of REBEL and UIE is weak, and they only obtain an F1-score of 74.0% and 73.0%, respectively. This is in line with our expectations, since they cannot adequately capture context and entity affinity. The generative MRC model, which generates the answer to a given question, has more generalization and normalization capabilities, resulting in acceptable results.

Table 7 Normalization comparisons of different models (%)

Conclusion

In this study, we cast the RE task as a generative MRC task using contrastive learning. Specifically, we propose an effective generative MRC framework that generates entity–relations in multi-turn QA and a contrastive learning algorithm for efficient model learning. The experimental results show that our model can achieve competitive performance with the previous SOTA models using only coarse annotation. Based on our findings, we believe that generative modeling is highly promising and capable of implicit entity–relation inference and entity normalization.

In the future, we will explore using reinforcement learning to reward and punish for the error accumulation in multi-turn QA, which might potentially improve the extraction performance on Recall metric. Additionally, we will also consider switching to more information-extraction tasks, such as event extraction.