Keywords

1 Introduction

Transformer recently has drawn great attention in natural language processing because of its superior capability in capturing long-range dependencies [1]. Extracting entity pairs with relations from unstructured text is an essential step in the construction of automatic knowledge database. Joint extraction of entities and all the possible relations between them at once, which considers the potential interaction between the two subtasks and eliminates the error propagation issue in traditional pipeline method [2, 3]. A typical joint extraction scheme is ETL-Span [4], which transforms information extraction into a sequence labelling problem with multi-part labels. It also proposed a novel decomposition strategy to decompose the task into simpler modules, that is, to decompose the task into several sequence label problems hierarchically. The key point is to distinguish all candidate head entities that may be related to the target relation starting from the beginning of the sentence, and then mark the corresponding tail entity and relation for each extracted head entity. This method achieves excellent performance in overlapping entity extraction.

Despite the efficiency of this framework, it is weak for the limited feature representation comparing with other complex models, especially transformer-based encoder BERT [5]. Using BERT to encode sentence extraction features could share feature representation with advanced semantic information. However, the Transformer [6] based network structure is a superposition of self-attention mechanism, which is inherently unable to learn the sequential relations of sentences. The position and order of words in the text are very important features, which will affect the accuracy of information extraction task in which the target is determined by the boundary.

To address the aforementioned limitations, we present our cRPE-Span model, which makes the following contributions:

  1. 1.

    The shared embedding module is improved through BERT, and the complex field relative position encoding is added to represent the relative position information between entities, so that the extractor can consider the semantic and position information of the given entity when marking the tail entity and relation.

  2. 2.

    The hierarchical boundary marker only marks the entity start and end position in a cascade structure and ignores the entity category, which could reduce the task difficulty for one step prediction process, and then alleviate the accumulated error.

  3. 3.

    Our method achieves consistently better performances on three benchmark datasets of entity and relation joint extraction, obtaining a better result on NYT-multi dataset with F1 score of 0.935.

2 Related Works

The entity-relation extraction task has always been widely concerned for its crucial role in information extraction. For most traditional methods ignore the interaction between entity recognition and relationship extraction, researchers have proposed a variety of joint learning methods with end-to-end neural architectures [4, 7,8,9]. Unfortunately, due to the shared encoder limitation, these methods cannot fully exploit the inter-dependency between entities and relations.

Introducing powerful transformer-based BERT to encode the input information could enhance the capability of modeling the relationship of tokens in a sequence. The core of transformer is self-attention, however, the self-attention has an inherent deficiency that it does not contain sequential order information of input tokens, so that it needs to add positional representations to encode information explicitly. The approaches for positional representations of transformer-based network can fall into two categories. The first one is the absolute position encoding, which inject the positional information to the model by encoding the positions of input tokens from 1 to maximum sequence length. Typically, sinusoidal position encoding in Transformer and learned position encoding in BERT, GPT [10]. However, such absolute positions cannot model the interaction information between any two input tokens explicitly. Therefore, the second relative position encoding (RPE) extends the self-attention mechanism to consider the relative positions or distances between sequential elements. Such as the model NEZHA [11], Transformer-XL [1], T5 [12] and DeBERTa [13]. As such information is not necessary for non-entity tokens, and may introduce noise on the contrary. Different from the relative positions mentioned above, we introduce complex relative position encoding (cPRE) into BERT for entity and relation joint extraction.

3 Method

cRPE-Span joint extraction structure is an end-to-end neural architecture, which jointly extracts entities and overlapping relations. We first add the cRPE to the powerful transformer-based BERT, and then use it to encode the input information for more accurate representation of the relative position information between entities. In the joint extraction structure, we use span-based tagging scheme as well as the reasonable decomposition strategy. In essence, the framework reduces the influence of redundant entity pairs, and captures the correlation between the head entity and the tail entity, thus obtaining better joint extraction performance. Figure 1 shows the framework diagram of our cRPE-Span extraction system.

Fig. 1.
figure 1

Framework diagram of our cRPE-Span extraction system

3.1 Shared Feature Representation Module

Intuitively, the distance between entities and other context tokens provides important evidence for entity and relation extraction. So we inject location information for this network structure by adding position encodings to the input token embedding. In Transformer, absolute positional encoding in the form of sine and cosine function is generally used, which can ensure that each position vector is not repeated and there is a relationship between different positions. However, Yan et al. [14] found that the location information of trigonometric function, which is commonly used in Transformer, will lose its relative relationship in the process of forward propagation. Similarly, the embedding vectors of different positions have no obvious constraint relationship in transformer-based BERT. Because the embedding vectors of each position are independently trained in BERT, so they can only model absolute position information, and not model the relative relationship between different positions (such as adjacency and precursor relationship).

In order to make the model capture more accurate relative position relationship, we add the cRPE to the input of BERT except its origin learned position embedding. The continuous function of complex field is adopted to encode the representation of words in different positions. In this paper, the input embedding vector of BERT is the superposition of four embedding features, namely piece-wise word embedding, segmentation embedding, learned position embedding and complex field position embedding.

Relative Position Embedding in Complex Field.

Typically, the method to encode the relative position between the token \({{\varvec{x}}}_{i}\) and \({{\varvec{x}}}_{j}\) into vectors \({{\varvec{p}}}_{ij}^{V}\), \({{\varvec{p}}}_{ij}^{Q}\), \({{\varvec{p}}}_{ij}^{K}\in {\mathbb{R}}^{{d}_{z}}\) is encoding the positional vectors into the self-attention module, which reformulates the self-attention module as

$${{\varvec{z}}}_{{\varvec{i}}}={\sum\nolimits_{j=1}^{n}{\alpha }_{ij}({\varvec{x}_{j}}}{W}^{V}+ {{\varvec{p}}}_{{\varvec{i}}{\varvec{j}}}^{{\varvec{V}}})$$
(1)

each weight coefficient \({\alpha }_{ij}\) is computed using a softmax:

$${\alpha }_{ij}=\frac{\mathrm{exp}\left({e}_{ij}\right)}{{\sum_{k=1}^{n}exp(e}_{ik})}$$
(2)

where \({e}_{ij}\) is calculated using a scaled dot-product attention:

$${e}_{ij}=\frac{\left({{\varvec{x}}}_{i}{{\varvec{W}}}^{V}+{{\varvec{p}}}_{ij}^{Q}\right){\left({{\varvec{x}}}_{{\varvec{i}}}{{\varvec{W}}}^{{\varvec{V}}}+{{\varvec{p}}}_{ij}^{Q}\right)}^{T}}{\sqrt{{d}_{z}}}.$$
(3)

Instead of simply adding the word vector and the position vector, we use a function to add the position information modeling the relative position of words. This function is continuously changing with the position. Like complex relative position representations proposed by Wang et al. [15], we first define a function to describe the word in the text with position pos and index j as:

$$f\left(j, pos\right) ={{\varvec{g}}}_{j}(pos)\in {\mathbb{R}}^{D}$$
(4)

g is a vector-valued function, which satisfies the following two properties:

1. There exists a function \(\mathrm{T}: {\mathbb{C}}\times {\mathbb{R}}\to {\mathbb{R}}\) such that for all \(pos \ge 0, n\ge 0\), \({{\varvec{g}}}_{i}(\text{pos + n) = T(n,}{ {\varvec{g}}}_{i}(\text{pos))}\). Namely, if we know the word vector representation of a word at a certain position, we can calculate the word vector representation of it at any position. That is to say, linear transformation has nothing to do with position, but only with relative position.

2. There exists \({\delta \in {\mathbb{R}}}_{+}\) such that for all position pos, \({{\varvec{g}}}_{i}(\text{pos) }<=\delta \). That is, the norm of the word vector is bounded.

If \(\mathrm{T}\) is a linear function, then \({{\varvec{g}}}_{i}(\text{pos)}\) admits only one solution in vector:

$${{\varvec{r}}}_{j}{e}^{i\left({{\varvec{w}}}_{j}pos + {{\varvec{\theta}}}_{j}\right)}$$
(5)

it can also be written in the form of components as:

$$\left[{r}_{j,1}{e}^{i\left({w}_{j,1}pos+{\theta }_{j,1}\right)},{r}_{j,2}{e}^{i\left({w}_{j,2}pos+{\theta }_{j,2}\right)},\dots ,{r}_{j,D}{e}^{i\left({w}_{j,D}pos+{\theta }_{j,D}\right)}\right]$$
(6)

In this way, we hope to get the word order modeling in this smooth way. Where \({{\varvec{r}}}_{j}\) is the amplitude, \({{\varvec{\theta}}}_{j}\) is the initial phase, \({{\varvec{w}}}_{i}\) is the angular frequency. Amplitude is only related to the index of words in the sentence, which represents the meaning of words and corresponds to ordinary word vectors. Phase \({{\varvec{w}}}_{j}pos + {{\varvec{\theta}}}_{j}\) is not only related to the word itself, but also related to the position of the word in the text. It corresponds to the position of a word in the text. When the angular frequency is small, the word vectors of the same word in different positions are almost constant. In this case, the word vector in complex field is not sensitive to position, which is similar to the ordinary word vector without considering position information. When the angular frequency is very large, the complex-valued word vector is very sensitive to the position and will change dramatically with the change of position.

3.2 Joint Extraction of Entities and Relations

The joint entity and relation extraction task is transformed into a sequential pointer marking problem. Firstly, the hierarchical boundary marker is used to mark the start and end positions in a cascade structure, and then the multi span decoding algorithm is used to jointly decode the head entity and tail entity based on the range marker, and the index of the start and end positions is predicted to identify the entity boundary.

Joint Extractor.

The extractor consists of a head entity extractor (HE) and a tail entity and relationship extractor (TER). For entity extraction, the HE and TER are decomposed into two sequential marking subtasks. The subtasks are to identify the entity starting and end position by using pointer network [16]. The difference HE and TER is that the TER would predict the relations at the same time. It is worth to note that the entity category information does not involve in this sequential marking process, that is, the model is no need to predict the entity category first, and then predict the relationship according to the category, and only need to predict the relationship according to the entity location information. Therefore, the task difficulty is reduced for the only one step prediction process, as well as the accumulated error is alleviated.

The purpose of HE extractor is to distinguish candidate entities and exclude irrelevant entities. Firstly, the triple library is constructed by training set, and after that the embedding vector sequence \({{\varvec{h}}}_{i}\) is obtained by embedding module. Then, the training data is searched remotely to obtain the prior information representation vector \({\varvec{p}}\). Finally, the feature vector \({{\varvec{x}}}_{i}=[{{\varvec{h}}}_{i};{\varvec{p}}]\) is obtained by connecting the feature coding vector sequence with the prior information representation vector. \({{\varvec{h}}}_{HE} ({{\varvec{h}}}_{HE}=\{{{\varvec{x}}}_{1}, \dots ,\) \({{\varvec{x}}}_{n}\)}) is used to represent the vector representation of all the words used for HE extraction. Inputting \({{\varvec{h}}}_{HE}\) into HE to extract the head entity, which includes all the head entities in the sentence and the corresponding entity location labels.

Similar to HE extractor, TER also uses basic representation \({{\varvec{h}}}_{i}\) and prior information vector \({\varvec{p}}\) as input feature. However, the combination of \({{\varvec{h}}}_{i}\) and \({\varvec{p}}\) is insufficient to detect tail entities and relationships with specific head entities. The key information needed for TER extraction includes: (1) words in tail entities (2) dependent header entities (3) context representing relationships. In a comprehensive way, we combine the head entity and context-related feature representation. That is, given a header entity \(h\), \({{\varvec{x}}}_{i}\) is defined as follows:

$${{\varvec{x}}}_{i}=[{{\varvec{h}}}_{i};{\varvec{p}};{{\varvec{h}}}^{h}]$$
(7)

Here, \({{\varvec{h}}}^{h}=[{{\varvec{h}}}_{sh};{{\varvec{h}}}_{eh}]\). \({{\varvec{h}}}_{sh}\) and \({{\varvec{h}}}_{eh}\) are the index of the beginning and end position of the head entity \(h\), respectively. [\({\varvec{p}};{{\varvec{h}}}^{h}]\) is the auxiliary feature vector of tail entity and relation extraction. We will take \({{{\varvec{h}}}_{THE}\boldsymbol{ }({\varvec{h}}}_{THE}=\{{{\varvec{x}}}_{1}, \dots ,\) \({{\varvec{x}}}_{n}\)}) as the input of hierarchical boundary annotation, and the output is obtained as \(\{(h, {rel}_{o}, {t}_{o})\}\), which contains all triples in sentence \(s\) given header entity \(h\).

In general, for a sentence with m entities, the whole joint decoding task includes two sequence annotation tasks for HE tags and 2 m for TER tags.

Loss Function.

In the training process, we aim to share the input sequence among tasks and carry out joint training. So for each training instance, we do not input sentences repeatedly in order to use all the triple information in the sentences, but randomly select a head entity from the labeled head entities as the input of TER extractor. At the same time, two loss functions are used to train the model, one is \({L}_{HE}\) for HE extraction, and the other is \({L}_{TER}\) for TER extraction.

$$L={L}_{HE}+{L}_{TER}$$
(8)

This optimization function can make the extraction of head entity, tail entity and relationship interact with each other, so that the element in each triplet can be constrained by another element. \({L}_{HE}\) and \({L}_{TER}\) can be defined as the sum of negative logarithm probability of real start tag and end tag:

$${L}_{HE,TER}=-\frac{1}{n}{\sum }_{i=1}^{n}\left(logP\left({y}_{i}^{sta}={\widehat{y}}_{i}^{sta}\right)+logP\left({y}_{i}^{end}={\widehat{y}}_{i}^{end}\right)\right)$$
(9)

Here, \({\widehat{y}}_{i}^{sta}\) and \({\widehat{y}}_{i}^{end}\) are real tags that represent the beginning and end positions of the i-th word, respectively. \(n\) is the length of the sentence. \({P}_{i}^{sta}\) and \({P}_{i}^{end}\) represent the prediction probabilities of the starting and ending positions of the i-th word as the target entity respectively.

$${P}_{i}^{sta,end}=sigmoid\left({w}_{sta, end}{x}_{i}+{b}_{sta,end}\right)$$
(10)
$${y}_{i}^{sta,end}={\upchi }_{\{{p}_{i}^{sta,end}>{threshold}_{sta,end}\}}$$
(11)

Here, \(\chi \) is an indicator function such that \({\chi }_{A}=1\) if and only if \(A\) is true.

4 Experiments

4.1 Datasets

We have conducted experiments on three datasets: (1) CoNLL04 was published by Dan et al. [17], we used segmented dataset with 5 relation types defined by Gupta and Adel et al. [18, 19], which contains 910 training data, 243 evaluation data and 288 test data. (2) NYT-multi was published by Zeng et al. [20]. In order to test the overlapping relation extraction in 24 relation types, they selected 5000 sentences from NYT-single as the test set, 5000 sentences as the verification set, and the remaining 56195 sentences as the training set. (3) WebNLG was released by Claire et al. [21] and used for natural language generation task. We used the WebNLG data preprocessed by Zeng et al. [20], including 5019 training data, 500 evaluation data,703 test data and 246 relation types.

4.2 Experimental Evaluation

We follow the evaluation metric in previous work [4, 22]. If and only if the relation type and two corresponding entities of a triple are correct, the triple is labeled as correct. If the head and tail position boundaries are correct, the entity is considered to be correct. We used standard Micro Precision, Recall and F1 scores to evaluate the results.

4.3 Experimental Parameters

We use the mini-batch mechanism to train our model, the batch size is 8, using the weighted moving average Adam to optimize the parameters. The learning rate is set to be 1e−5 and the stacked bidirectional transformer has 12 layers and 768 dimensions of hidden state. We used pretrained BERT base model [Uncased-BERT-Base]. The maximum length of the input sentence in our model is set to be 128. We did not adjust the threshold of the joint extractor, and set the threshold to 0.5 by default. All super parameters are adjusted on the validation set. In each experiment, we use an early stop mechanism to prevent the model from over fitting, and then report the test results of the optimal model on the test set. All our training and test results were performed on 32 GB Tesla V100 GPU.

5 Results and Analyses

5.1 Comparison Models

We mainly compare our model with the following baseline models: (1) Multi-Head [22] and (2) ETL-Span [4]. We reimplement these models on CoNLL04, NYT-multi and WebNLG datasets, marked with * in Table 1 and Table 2.

Table 1. Comparison of model results on CoNLL04 dataset (%)

Table 1 reports the results of our models against other baseline methods on CoNLL04 dataset. Our model achieved a comparable result with F1 score of 67.6%, and with the recall of 68.7%. We found that the result of our model is better than the method based on sequence-by-sequence encoding, such as Biaffine-attention and Multi-Head. This is probably due to the inherent limitation for RNN expansion to generate triples.

In Table 2, It can be seen that our proposed joint extraction based on complex position embedding method, cRPE-Span, significantly outperforms all other methods, especially on NYT-multi dataset with precision, recall and F1 score of 94.6%, 92.5% and 93.6%, respectively.

Table 2. Comparison of model results on NYT-multi and WebNLG datasets (%)

Compared with ETL-Span, a joint extraction method based on span scheme, the F1 scores of cRPE-Span on NYT-multi and WebNLG datasets have increased by 17.9% and 2.9%, respectively. In comparison with Multi-Head, the F1 scores of cRPE-Span on NYT-multi and WebNLG datasets increased by 14.6% and 5.2%, respectively. We consider that it is because (1) we decompose the difficult joint extraction task into several more manageable subtasks and handle them in a mutually enhancing way, this suggests that our HE extractor and TER extractor actually work in a mutually enhancing manner; (2) our shared feature extractor based on BERT with cRPE effectively captures the semantic and position information of the dependence of the first entity, while ETL-Span uses LSTM to shared encoding, and it needs to predict the category of entity, and then predict the relationship based on the category, that may cause error propagation issues. Overall, these results demonstrate that our extraction paradigm first extracts the head entity, and then marks the corresponding tail entity, and can better capture the relationship information in the sentence.

5.2 Ablation Study

To demonstrate the effectiveness of each component, we conducted ablation experiments by removing one particular component at a time to understand its impact on the performance. We study the influence of cRPE (complex relative positional encoding) and RSS (remote supervised search) on the WebNLG dataset, as shown in Table 3.

In the table we can find that: (1) when we delete the cRPE, the F1 score drops by 1.4%. This shows that relative position encoding plays a vital role in information extraction, allowing the tail entity extractor to know the position information of a given head entity, so as to filter out irrelevant entities through implicit distance constraints. Secondly, by predicting the entities in the HE extractor, we can explicitly integrate the entity location information into the entity representation, which is also very helpful for subsequent TER mark; (2) after removing the remote supervised search strategy, the F1 score dropped by 0.2%. The above comparison tests once again confirm the effectiveness and rationality of our cRPE and RSS strategy.

Table 3. Comparison of simplified model results (%)

5.3 Model Convergence Analysis

In order to analyze the convergence of our model, we conducted further experiments on three test datasets and selected our baseline model RSS-Span for comparison. The RSS-Span model is with the remote supervised search strategy, but without the complex relative positional encoding. To differentiate the test results of baseline and cRPE-Span model, the baseline results are drawn with black hollow circles, and the cRPE-Span results are drawn with blue solid circles, as shown in Fig. 2. The dash lines in the table are benchmark scores which are relatively smaller scores value in the best F1 scores. For the NYT-multi dataset, we select 92.8% of the F1 score between cRPE-Span and the baseline model, which is the smaller of 93.6% and 92.8%. Similarly, for the CoNLL04 and WebNLG datasets, the selected F1 benchmark scores are 66.1% and 85.7%, respectively. That is to say, we analyze the number of training epochs at this time when the benchmark score is reached.

Fig. 2.
figure 2

Comparison results of model convergence

From Fig. 2, we observe that the convergence of cRPE-Span is slightly inferior to that of RSS-Span. After training for about 100 epochs, RSS-Span reaches the F1 benchmark score, while cRPE-Span needs to be iterated to about 1000 epochs. This is because the cRPE-Span position embedding layer is a continuous function in the complex domain to encode the representation of words at different positions, which involves to be learned new parameters including amplitude, angle frequency and the initial phase. The parameters will increase the parameter amount of the embedding vector, and furthermore, it takes longer to train iteratively. In addition, we also observe that the performance stability is better than RSS-Span. The possible reason is that the increase in the number of parameters makes the model have better generalization ability, which further proves the superiority of our embedding method based on the relative position of complex domain.

6 Conclusion

In this paper, we improve a joint extraction method of entities and relationships based on an end-to-end sequence labeling framework with complex relative position encoding. The framework is based on a shared encoding of a pre-trained language model and a novel decomposition strategy. The experimental results show that the functional decomposition of the original task simplifies the learning process and brings a better overall learning effect. Compared with the baseline model, it reaches a better level on the three public datasets. Further analysis proves the ability of our model to handle multi-entity and multi-relation extraction. In the future, we hope to explore similar decomposition strategies in other information extraction tasks, such as event extraction and concept extraction.