Introduction

Triplet extraction involving relations constitutes a core task in natural language processing, with extensive applicability spanning information extraction, knowledge graph creation [1], question-answering systems [2], and recommendation systems [3]. The task primarily focuses on unveiling entities and their associated semantic relations within unstructured text, ultimately representing them as triples \(<s,r,o>\), as shown in Fig. 1, where r represents a predefined array of relations, while s and o denote subject and object entities, respectively. Two principal categories of models have been devised for this purpose: pipeline models and joint models. Pipeline models address entity recognition and relation extraction sequentially, potentially leading to issues related to error propagation. However, end-to-end joint extraction models for entities and relations have gained prominence due to their ability to effectively harness implicit interactions between named entity recognition and relation extraction, thereby mitigating concerns regarding error propagation.

Fig. 1
figure 1

Example diagram of triple extraction

Joint entity and relation extraction models can be categorized into two groups based on their decoding techniques: multi-step decoding models [4,5,6,7] and single-step decoding models [8,9,10]. Multi-step decoding models use multiple interconnected processing steps and modules to extract entities and relations sequentially. These modules share parameters but require distinct decoding algorithms at each step, leading to the accumulation of decoding errors that can adversely affect the model’s performance. Conversely, single-step decoding models independently identify entities and relations, combining them into a ternary structure based on their potential relevance. This approach mitigates the issue of decoding error accumulation to some extent because it employs a single decoding process. However, it also reduces the correlation between sub-tasks. Moreover, existing joint models often use shared encoding layers, but the contextual feature information that entity extraction and relation extraction focus on is not identical [11]. Therefore, performing identification without constraints can result in computational redundancy and further reduce the model’s performance.

Therefore, we propose a novel joint entity and relation extraction approach incorporating Multi-Module Feature Information Enhancement (MFIE). We designed two distinct modules to optimize entity extraction and relation extraction. For entity extraction, we introduced a relation awareness enhancement module. This module initially extracts potential relation information from the input sentence and then integrates it with the feature data obtained from the model’s encoding layer. This enables the entity extraction model to prioritize entities relevant to the relations of interest, reducing computational redundancy. Conversely, for relation extraction, we designed an entity information enhancement module. This module leverages the outcomes of entity extraction by incorporating them as feature information. A gating mechanism combines this entity feature information with the feature data from the encoding layer, enhancing the input information for the relation extraction model and improving its performance. The effectiveness of our approach was validated through extensive experiments on two commonly used datasets: NYT and WebNLG, consistently delivering robust performance even in complex scenarios characterized by overlapping triplets and multiple triplets.

The primary contributions of this study are as follows:

  • We proposed a relation-aware enhancement module that effectively utilizes potential relation information in sentences to constrain the extraction of entity information, reducing the capability of the model to identify irrelevant relation entities and mitigating the computational overhead.

  • We designed an entity information enhancement module, which effectively incorporated entity information obtained from entity extraction to enrich the features extracted at the relation extraction layer, enhancing the relation extraction capability of the model.

  • In order to reduce the problem of entity sparsity as well as relationship sparsity in the text, we employ a global pointer network at the decoding layer and use a sparse cross-entropy loss function to mitigate the sparsity problem and improve the performance of the model’s relationship extraction.

Related work

Traditional approaches to entity relation extraction often adopt a pipeline methodology [12,13,14], breaking down the task into distinct sub-tasks of named entity recognition and relation extraction. However, this conventional method neglects to harness the inherent interactions between these two tasks, leading to a growing preference for the joint entity and relation extraction approach in current research.

Early joint extraction methods were usually categorized as multi-task learning methods. These methods essentially constructed two separate models for entity extraction and relation extraction, optimizing them uniformly by sharing parameters. The initial investigation [4] introduced sequence labeling for entity prediction and employed a tree-based LSTM (Long Short Term Memory) model for relation extraction. This innovative approach facilitated parameter sharing by utilizing shared LSTM layers. Building on this, the authors of [15] adopted a joint model of attention-based recurrent neural networks to address the shortcomings of tree structures. In [16], the authors enhanced the performance of the relation classification model by framing it as a multi-head selection problem, effectively resolving the challenge of handling multiple overlapping relations. The CasRel (Cascade Binary Tagging Framework) model [17] innovatively divided entity relation extraction into two stages: head entity recognition and relation extraction. Firstly, the head entity recognition layer was binary classified, and then relation extraction and corresponding tail entity recognition were performed based on the head entity.

While the CasRel model improved the generalization of entity relation extraction tasks, these methods were essentially pipelined methods, specifically multi-step decoding methods, and encountered issues of error propagation. Joint decoding models, which are designed based on table-filling methods [18], have introduced a fresh perspective to the joint entity and relation extraction task. In [9], the authors merged a graph neural network with a table-filling model to adeptly capture the associative information between the two subtasks. Additionally, in [10], a table-filling model was proposed, employing two independent encoders: a sequence encoder to gather information for the entity recognition task and a table encoder to acquire information for the relation extraction task.In contrast, the TPlinker model [19] took a different approach by treating joint entity-relation extraction as a Token Pair Linking problem. It introduced a specialized handshake marking scheme tailored to relations, facilitating the alignment of boundary tokens between entity pairs. The model exhibited certain advantages when dealing with multiple relations. However, it constructed matching matrices based on sentence sequences for global relations, leading to data sparsity and slower training.The EmRel (Joint Representation of Entities and Embedded Relations) model [20] introduced a relational representation and exploited rich interactions across relations, entities, and contexts. Nevertheless, it used newly initialized relations for direct embedding, resulting in noisier information and ignoring heterogeneity between entities.

Although the joint decoding algorithm alleviates the error transfer problem caused by pipelined models to a certain extent, it still suffers from computational redundancy when combining triples. We argue that this issue arises because the aforementioned models still adopt model chunking and step-by-step ternary extraction, whereas entity extraction and relation extraction do not focus on the same contextual feature information. Thus, the shared encoding layer approach leads to feature information conflicts and an excessive number of negative samples when the model computes features.

To tackle these challenges, we designed a joint entity and relation extraction approach combined with Multi-Module Feature Information Enhancement (MFIE). We introduced a relation awareness enhancement module and an entity information enhancement module, tailored for the entity extraction and relation extraction tasks, respectively. Specifically, we first generate sentence-level relation feature vectors through an encoder and extract potential relation feature representations of sentences through a classification task. Subsequently, these are combined with the original entity feature information to perform entity recognition via a global pointer network. Afterwards, the obtained entity-extracted feature representation is combined with the feature information of the relation extraction model through a gating mechanism to obtain a new feature representation with enhanced entity information. Finally, relation extraction is performed using a global pointer network. Experimental results corroborate the efficacy of our model on public datasets.

Methodology

The overall process of MFIE model is illustrated in Fig. 2. The algorithm consists of two main phases: In the first phase, the relation-aware enhancement module is designed to incorporate relation feature information into the input of the entity extraction model. During the second phase, the entity information enhancement module is crafted to fuse the entity extraction results with the relation extraction model’s input. This integration empowers the relation extraction model to access enhanced entity feature information at its input.

Fig. 2
figure 2

MFIE model

Task definition

In a given sentence \(S = \left\{ W_1, W_2, \ldots , W_n\right\} \) containing predefined relations \(R = \left\{ r_1, r_2, \ldots , r_m\right\} \), where n represents the sentence length and m denotes the number of relations, the entity-relation joint extraction task aims to identify all possible triples within the sentence \(T = \left\{ (s, r, o), |, s, o \in E,, r \in R\right\} \). Here, s represents the head entity, o signifies the tail entity, r corresponds to the relation, and E constitutes the set of entities.

Encoder module

In the encoding layer, we employ the BERT pre-trained model as a sentence encoder to transform each token of the sentence into embeddings, as illustrated in Eq. (1).

$$\begin{aligned} D = \text {BERT}(w_1, w_2, \ldots , w_n) = (d_1, d_2, \ldots , d_n) \end{aligned}$$
(1)

Here, D represents the embedding vector acquired for each sentence, where n denotes the sequence length, and \(d_i \in \mathbb {R}^l\) signifies the embedding vector representation of each token, with l representing the dimension of the embedding vector.

Entity extraction layer based on relation awareness enhancement

In order to enhance the model’s capacity to filter out irrelevant relation information and mitigate computational redundancy during the entity extraction process, we introduce a relation awareness enhancement module to bolster the model’s entity extraction capabilities. The complete module design is depicted in Fig. 3.

Fig. 3
figure 3

Structure of relation awareness enhancement module

To obtain the relation feature information associated with each word in a sentence, we initially construct a sentence relation representation vector \(V = \left\{ v_1, v_2, \ldots , v_m\right\} \). In this context, m refers to the total number of potential relations, \(v_i\) signifies the relations present within the sentence, and i serves as the relation identifier (provided in the dataset). If the sentence does not include the relation numbered i, the corresponding \(v_i\) is set to 0; if the sentence includes the relation numbered i, then \(v_i\) is set to i.

Following that, we encode the relational representation of the sentence to generate a relational embedding vector \(V_s\). Given that the number of relations in the sentence is significantly lower than the total number of relations, the embedding vector acquired may contain a noticeable level of error. To procure a more precise relation embedding vector, it is essential to input it into the self-attention mechanism for attention calculation, as delineated in Eqs. (2) and (3).

$$\begin{aligned} V_s= & {} \text {Embedding}(V) \end{aligned}$$
(2)
$$\begin{aligned} V_s= & {} \text {SelfAttention}(V_s) \end{aligned}$$
(3)

The Relational Feature Vector \(V_s \in \mathbb {R}^{m \times l}\) represents a feature vector that encompasses all relations in a sentence. To combine the relation feature vector with the word vector, we employ a sentence-level relation extraction module to process the embedding vectors obtained from the BERT model. This module yields the character-level relation prediction set, as shown in Eq. (4).

$$\begin{aligned} P_r=\text {Relu}(W_rD+b_r) \end{aligned}$$
(4)

Here, \(W_r\) and \(b_r\) represent the trainable weights and biases, respectively. \(\text {Relu}\) denotes the activation function, and \(P_r\) signifies the acquired word-level representation of the relation prediction.

Finally, the relational feature vector is multiplied by the word-level relational prediction representation to obtain word-level relational feature information. This obtained word-level relational feature information is then added to the original BERT word vector. This process enhances the relational information in the input vectors of the final entity extraction model, thereby improving the accuracy of entity extraction, as illustrated in Eqs. (5) and (6).

$$\begin{aligned}{} & {} D_r=\text {Matmul}(P_r,V_s) \end{aligned}$$
(5)
$$\begin{aligned}{} & {} D_{ent}=D+D_r \end{aligned}$$
(6)

where \(D_r\) is the relation feature information at the word level, and \(D_{ent}=\{h_1,h_2,...,h_n\}\) is the input vector of entity extraction layer, \(h_i\) is each vector in the sequence vector. We input the input vector enhanced with relation information into the entity extraction decoding layer for entity extraction. Guided by the relation information, we can obtain more accurate entity recognition results.

At the entity extraction decoding layer, we adopt the decoding method of global pointer network inspired by the literature [21]. Different from the ordinary pointer network,as shown in Fig. 4, the global pointer network considers the head and tail of the entity as a whole to identify, as shown in Eqs. (79).

$$\begin{aligned}{} & {} q_{i,a}=W_{q,a}h_i+b_{q,a} \end{aligned}$$
(7)
$$\begin{aligned}{} & {} k_{i,a}=W_{k,a}h_i+b_{k,a} \end{aligned}$$
(8)
$$\begin{aligned}{} & {} E_a(i,j)=q_{i,a}^TO_i^TO_jk_j,a \end{aligned}$$
(9)

In this case, W and b denote the trainable parameters. In addition, \(q_{i,a}\in R^l\) and \(k_{i,a}\in R^l\) are vector representations identifying entities of type a, specifically, \(q_{i,a}\) and \(k_{i,a}\) are vector representations of the head token and tail token of the entity, respectively. Here, l denotes the vector dimension. i and j denote the initial and final position of the entity in the sequence, respectively. o denotes the relative position information of each entity in the sequence, while \(E_a \in R^{ann}\) denotes the score matrix of the entity, where a denotes the number of types, n corresponds to the length of the sequence,and \(E_a(i,j)\) denotes the vector representation corresponding to the position of (ij) in the matrix, the vector representation of the corresponding entity. Since entity extraction in the Entity Relationship Extraction task does not need to identify specific types of entities, but only needs to classify entities as head entities or tail entities, we consider entity classification as a binary task \(a=2\), classifying them as head entities or tail entities.

Fig. 4
figure 4

Head and tail entity extraction based on global pointer network

Relation extraction layer based on entity information enhancement

The structural diagram of the entity information enhancement module is depicted in Fig. 5.

Fig. 5
figure 5

Entity information enhancement module structure diagram

Specifically, The output of the entity extraction layer is a score matrix for the whole entity \(E_a\in R^{a*n*n}\), here consider \(a=1\) regarded as the head entity and \(a=2\) regarded as the tail entity, we need to split this matrix to get the corresponding head token and tail token of the head entity and the head token and tail token of the tail entity.

First, the matrix is split into a head entity matrix and a tail entity matrix. This is shown in Eq. (10).

$$\begin{aligned} E_1,E_2=\text {chunk}(E_a) \end{aligned}$$
(10)

where \(E_1\in R^{n*n}\) is the head entity, and \(E_2\in R^{n*n}\) is the tail entity. \(\text {chunk}\) is the matrix split function, and here it is split in dimension a to perform the splitting.

Due to the nature of the global pointer network, we can consider each \(n*n\) matrix as representing the positional relations among words in a sentence sequence of length n. Therefore, we can treat the words along the row coordinates as head tokens and the words along the column coordinates as tail tokens. Subsequently, we will reduce the dimensions of the matrix through aggregation. Specifically, we will reduce matrices \(E_1\) and \(E_2\) row-wise into vectors \(D_{h,s}\) and \(D_{h,e}\), and column-wise into vectors \(D_{t,s}\) and \(D_{t,e}\). In this context, \(D_{h,s}\) and \(D_{t,s}\) denote the head and tail tokens of the head entity, respectively, while \(D_{h,e}\) and \(D_{t,e}\) represent the head and tail tokens of the tail entity.

Subsequent analysis of the obtained \(D_{h,s}\), \(D_{h,e}\) and \(D_{t,s}\), \(D_{t,e}\) are combined to obtain the head token pair vector and tail token pair vector as shown in Eqs. (11, 12).

$$\begin{aligned}{} & {} D_h=D_{h,s}+D_{h,e} \end{aligned}$$
(11)
$$\begin{aligned}{} & {} D_t=D_{t,s}+D_{t,e} \end{aligned}$$
(12)

Here, \(D_h\) signifies the vector depiction of the head token pair, while \(D_t\) denotes the vector portrayal of the tail token pair. Then, using the gate mechanism, the original BERT word vector representation is enhanced with entity information obtained by \(D_h\) and \(D_t\) as gate functions, as shown in Eq. (13,14), so that the input information of the relation extraction model can include entity information obtained from entity extraction, thereby improving the performance of relation extraction.

$$\begin{aligned}{} & {} D_{relh}=D+D*\text {sigmoid}(D_h) \end{aligned}$$
(13)
$$\begin{aligned}{} & {} D_{relt}=D+D*\text {sigmoid}(D_t) \end{aligned}$$
(14)

where D is the original BERT word vector representation. \(D_{relh}\) and \(D_{relt}\) are the head token pair vector representation and the tail token pair vector representation to be input to the relational extraction model to get entity information enhancement, respectively.

In the relation extraction decoding layer, we continue to employ the global pointer network for decoding, but with a different configuration. In this case, the relation type is denoted as \(a=m\), where m represents the dataset’s overall count of relations. The head token pairs and tail token pairs are separately fed into the global pointer network for decoding, resulting in the score matrices \(E_h\in R^{ann}\) and \(E_t\in R^{ann}\).

Training strategies

Influenced by prior research [22], we utilize a sparse variant of the multi-label cross-entropy loss function for model training. The traditional multi-label cross entropy loss function, on the other hand, is computed by enumerating the positive and negative samples separately, and its general structure is shown in equation (15). specifically, the prediction errors of the positive class T and the negative class F are calculated and summed up to get the final loss value. Where a positive sample is a set of samples belonging to the class, and a negative sample is a set of samples not belonging to the class, as shown in Fig. 4, where 1 corresponds to a positive sample and 0 corresponds to a negative sample.

$$\begin{aligned} \text {loss} = \log \left( 1 + \sum _{i \in T} e^{-D_i}\right) +\log \left( 1 + \sum _{i \in F} e^{D_i}\right) \end{aligned}$$
(15)

where T denotes the group of positive classes, F signifies the group of negative classes, and \(D^i\) signifies each of the score matrices acquired previously.

Since the number of negative samples is much larger than that of positive samples, a larger computational burden is incurred when performing the computation of stroking negative samples, so we chose the sparse multi-label cross-entropy loss function. The sparse version of multilabel cross-entropy is calculated by substituting the whole samples set for the negative samples, as shown in in Eq. (16). Specifically, the whole samples set comprises both positive and negative samples. If the loss were to be calculated directly on the positive and negative samples, it would necessitate index-based judgments for each sample. However, calculating across the whole samples set bypasses the indexing phase, allowing direct computation on the entire prediction matrix. Therefore, replacing the more numerous negative samples with the whole samples set can significantly reduce the computational burden.

$$\begin{aligned} \log \left( 1 + \sum _{i \in F} e^{D_i}\right) = \log \left( 1 + \sum _{i \in A} e^{D_i}-\sum _{i \in T} e^{D_i}\right) \end{aligned}$$
(16)

where \(A = T \cup F\), thus the loss values for the set of negative classes can be calculated from A and T, greatly reducing the amount of computation. Ultimately, based on the requirements of the entity-relation extraction task, the calculated loss values are derived by calculating and subsequently averaging the score matrices for all three: entity extraction, head token-to-relation extraction and tail token-to-relation extraction, as shown in Eq. (17).

$$\begin{aligned} \text {loss}=\left( \text {loss}\left( E_a\right) +\text {loss}\left( E_h\right) +\text {loss}\left( E_t\right) \right) /3 \end{aligned}$$
(17)

where \(E_a\) is the score matrix for entity extraction, \(E_h\) is the score matrix for head token-to-relation extraction, and \(E_t\) is the score matrix for tail token-to-relation extraction.

Experimental results and analysis

Dataset

We performed experiments on two publicly accessible datasets: NYTFootnote 1 and WebNLG.Footnote 2 The aim was to assess the efficacy of our MFIE model by comparing it with classical models. Table  1 provides specific dataset details. In Table  2, “Normal” indicates datasets without overlapping triples, “SEO” signifies scenarios with only one shared entity among triples, “EPO” designates datasets where triples share the same entity pair, and “N” represents the number of triples.

Table 1 Dataset specific information
Table 2 Test set specific information

Assessment of indicators

For an equitable comparison with prior research, we employ Precision Recall, and F1 scores as evaluation metrics for our model. The formulas are shown in Eqs. (18, 19, 20).

$$\begin{aligned}{} & {} \text {Precision}=\text {TP}/(\text {TP}+\text {FP}) \end{aligned}$$
(18)
$$\begin{aligned}{} & {} \text {Recall}=\text {TP}/(\text {TP}+\text {FN}) \end{aligned}$$
(19)
$$\begin{aligned}{} & {} \text {F1}=2*\text {Precision}*\text {Recall}/(\text {Precision}+\text {Recall}) \end{aligned}$$
(20)

where TP is the number of entities correctly identified, FP is the number of non-entities incorrectly identified as entities, and FN is the number of entities incorrectly identified as non-entities.

Experimental setup

In our experiments, we trained the model on a RTX 4090 GPU running on the Windows 10 OS. We employed the BERT-Base-Cased version for our pre-trained model, featuring 12 transformer layers with a hidden layer size of 768 and 12 self-attention heads. The hyper-parameters of the model are specified in Table  3.

Table 3 Experimental parameters

Baseline model

This model will be compared with the following baseline models:

  1. 1.

    The CasRel model [17] first extracts information about the head entity and subsequently extracts information about the tail entity and entity relations using the head entity information as a criterion.

  2. 2.

    The TPlinker model [19] approaches the joint extraction of entity relations as a token-pair linking problem and incorporates a handshake marking scheme to align entity pairs.

  3. 3.

    The EmRel model [20] introduces a relational representation and exploits rich interactions across relations, entities and contexts to enhance the learning model.

  4. 4.

    The PRGC model [23] presents a joint extraction framework that emphasizes latent connections and worldwide correspondences.

  5. 5.

    SPN4RE model [24] proposes an integrated prediction model.

  6. 6.

    The OneRel model [25] employs a solitary module and a single-stage approach to extract triplets directly from text.

  7. 7.

    The PMEI model [26] introduces an incremental multi-task learning strategy that leverages early predicted information interactions to enhance representations specific to each task.

  8. 8.

    The PARE model [27] proposes a remote supervision based learning model.

  9. 9.

    The ERGM model [28] suggests a unified extraction approach founded on a global entity matching strategy, incorporating a relational attention mechanism for embedding relational representations, similar to EmRel.

  10. 10.

    The PRDCEM model [29] utilizes a cross-attention mechanism to detect relational information to obtain relational information embedding, and a negative sampling strategy to reduce error propagation as a way to improve the model’s relational extraction performance.

  11. 11.

    The TERS model [30] introduces a sequence of relations to connect the two triple extraction steps, filters out irrelevant information, and uses iterations to interact with the information.

Analysis of experimental results

To assess the effectiveness of the MFIE model, we carried out experiments on both the NYT and WebNLG datasets. The training process for these experiments is depicted in Fig. 6, with the red curve representing the F1 change for the NYT dataset and the green curve for the WebNLG dataset. Figure 6 displays the F1 change curve of our model during the training process, utilizing the first 20 epochs for training. The curve clearly illustrates the rapid overall training convergence, with the model reaching a fitting state in just 15 epochs.It should be noted that compared with the NYT dataset, the model’s fitting speed on the WebNLG dataset is relatively slow. This article analyzes that this is because the WebNLG dataset has a smaller amount of data but a larger number of relations. For the relation-aware enhancement module in our MFIE model, it has a higher learning burden, hence the slower fitting speed.

Fig. 6
figure 6

Curve of F1 change in model training

Comparison experiment

To evaluate the efficacy of our model, we performed a comparative analysis with the baseline on two separate datasets, as depicted in Table  4.

Table 4 Comparative experimental results

Table  4 presents a comparison between our model’s experimental outcomes and those of other baseline models on both the NYT and WebNLG datasets. The superior results are highlighted in bold, while the second-best ones are underscored. On the NYT dataset, our model excels, demonstrating a remarkable 0.7 increase in the final F1 value compared to the runner-up. Conversely, with the WebNLG dataset, our model performs comparably to the top baseline model. Still, it secures the second-best position by elevating the F1 value when contrasted with the other baseline models. This incremental enhancement on the WebNLG dataset is primarily due to its relatively smaller dataset size, yet an abundance of relations, which places constraints on our relation-aware enhancement module’s learning capacity. Despite our endeavors to augment the module’s aptitude for learning sentence relations through the integration of the relation attention mechanism, the advancement remains somewhat limited. In contrast, the NYT dataset provides an ample volume of training data and features a significantly smaller number of relations in comparison to the WebNLG dataset. Consequently, the relation-aware enhancement module has more extensive opportunities for proficient learning, ultimately leading to a superior overall performance. Thus, our forthcoming research will be dedicated to exploring more effective methods for extracting relational information from sentences.

Efficiency analysis

In this section, we compare the efficiency of our model with the best performing OneRel model mentioned above on the NYT dataset, as shown in Table  5, with batchsize uniformly set to 6 and maxlength uniformly set to 512, and in order to make convenient comparisons, we compare at the first epoch. Where Tt denotes the training time, Dt denotes the validation time, and Sum denotes the memory share of training.

Table 5 Results of efficiency comparison experiments

As can be seen in Table  5, the training time of our model on the first Epoch is 55 s, which is 10 s less compared to the OneRel model. The validation time of our model on the validation set is 4 s, which is 3 s less compared to the OneRel model. In terms of memory usage, our model is 5592 MB less than the OneRel model for the same batchsize. All the results show that our model is better in terms of computational efficiency and is more competitive in real-world application environments. We analyse that this is because our model employs decoding entity extraction and relation extraction separately at the decoding end, which reduces the dimensionality of computing the score matrix. The OneRel model, on the other hand, although better in terms of effectiveness, has a higher dimensionality and a greater computational burden due to the single-step decoding method, which results in a score matrix that contains both entity and relation components.

Detailed results for complex scenarios

To confirm our model’s capability to handle sentences with overlapping and multiple triples, we conducted extension experiments inspired by CasRel. These experiments were performed using the NYT and WebNLG datasets, and we selected the same four models as baselines. Tables 6 and 7 display the detailed experimental comparisons. Bold text highlights the best results obtained.

The results reveal that our model achieves the highest F1 scores in 13 out of 16 subsets. This demonstrates our model’s distinct advantage in handling cases involving straightforward overlapping and multiple triples.

Table 6 Experimental results in overlapping ternary group mode
Table 7 Experimental results in multi-triad mode

Ablation experiment

To further affirm the efficiency of our model and assess the influence of the two modules on its performance, we conducted ablation experiments. Table  8 presents the F1 score results from these ablation experiments on the overall model. We conducted comparative experiments by removing modules, and the models in the table are the original model (Ours), the model with the entity information enhancement module (EIE) removed, the model with the relation awareness enhancement module (RAE) removed, and finally the model with both modules removed, only retaining the embedding layer and encoding layer.

From Table  8, it can be observed that both the relation awareness enhancement module and the entity information enhancement module have a significant impact on the model. However, the enhancement of the entity information enhancement module is less for the model compared to the relation awareness enhancement module. We suggest that this is due to error accumulation. Although the module in this paper facilitates the interaction between subtasks to a certain extent, it also increases the error accumulation of the whole model accordingly, thus limiting the enhancement effect of the module. The improvement in the relation awareness enhancement module is less pronounced on the WebNLG dataset compared to the NYT dataset. This observation aligns with our earlier argument that the WebNLG dataset has a larger number of relations but a smaller dataset size, which limits the module’s learning capacity and consequently results in a smaller enhancement effect. In future research, we will explore methods to leverage entity extraction results more effectively to enhance relation extraction, addressing the issue of error accumulation in the entire model. Additionally, we will investigate techniques to enhance the learning ability of the relation-aware module when working with smaller datasets.

In order to further verify the improvement effect of the relation awareness enhancement module on the model, this paper first removes the entity information enhancement module in the model, retains the relation awareness enhancement module as the original model “R”, and conducts comparative experiments on the WebNLG dataset through ablation experiments. As shown in Fig. 7, “R-w/o attention” represents the model without attention mechanism, and “R-w/o RAE” represents the model without the entire relation awareness enhancement (RAE). Among them, s represents the F1 value of head entity extraction, o represents the F1 value of tail entity extraction, (s, o) represents the F1 value of entity extraction, and (s, r, o) represents the F1 value of the entire triple extraction. The relation-aware enhancement module indeed enhances the entity extraction capability of the model. Furthermore, the introduction of the attention mechanism in the module further enhances entity extraction and improves sentence relation extraction. As depicted in the figure, the enhancement of entity extraction ability positively impacts the model’s triple extraction performance as well.

Table 8 Results of ablation experiment F1 values
Fig. 7
figure 7

Results of ablation experiments on the WebNLG dataset regarding the relation-aware module

Impact of different pre-training models

In order to verify the scalability of our method, we selected four different pre-training models, BERT-Small, BERT-base, BERT-large, and roberta-base, and experimented on both NYT and WebNLG datasets. As shown in the Table  9, based on the use of different pre-training models, our method can achieve more than 90% F1 value on both individual datasets, which reflects better performance. This proves the scalability of our method.

Table 9 Experimental results under different pre-training models F1 value

Case studies

In this section, we present specific examples from both the NYT dataset and the WebNLG dataset to analyze our models, as depicted in Fig. 8. In the figure, the green markers represent all the triples in the sentence, the red color indicates incorrectly recognized triples, and the blue color indicates correctly recognized triples. We compare the specific recognition results under four conditions: “Ours”, “w/o EIE”, “w/o RAE” and “w/o RAE and EIE”.

Fig. 8
figure 8

Case studies

According to Fig. 8, the “w/o EIE” model lacks a relation-aware enhancement module, which results in the inability to filter out entities that do not have a relation in the sentence during the entity extraction process. As a result, wrong entities corresponding to wrong relations are extracted. For example, in the first sentence, the relation “/people /deceased_person /place_of_death” does not exist in the original sentence. However, since it is not possible to filter irrelevant relations in advance, the model still extracts the relation and the corresponding entity. The “w/o RAE” model lacks an entity information enhancement module, which results in the failure to identify the correct entity when extracting relations. For example, in the second sentence, the triad “[“Memorial”, “leaderTitle”, “Azerbaijan”]” is incorrectly extracted, where “Memorial” is the wrong entity. The “w/o RAE and EIE” model, however, lacks interaction between subtasks after the removal of the two modules. This causes the errors of entity extraction subtasks to accumulate with the errors of relation extraction subtasks.Ultimately, the model performs poorly in extracting triples and suffers from incomplete recognition, e.g., in the first sentence it suffers from incomplete extraction and can only recognize a single type of relationship between pairs of entities, making it difficult to extract multiple relationships for the same pairs of entities, and in the second sentence, due to the lack of help from the two modules for their respective subtask layers, it suffers from the same problem of incomplete extraction and fails to filter out erroneous triples.From the results, the two modules proposed in this model can significantly boost the interaction between subtasks and improve the model’s ternary extraction ability to achieve the best extraction results.

Conclusion

This paper presents a method for Joint Entity and Relation Extraction Combined with Multi-Module Feature Information Enhancement (MFIE). Initially, we utilize a BERT pre-trained encoder to obtain word embedding vectors from the text. Subsequently, we incorporate two specialized modules aimed at enhancing entity extraction and relation extraction: the relation awareness enhancement module and the entity information enhancement module. The relation awareness enhancement module captures potential relation information from sentences through a potential relation extraction module and an attention mechanism. It then integrates this information with BERT-encoded data to ensure that the input information of the entity extraction layer includes relation information while reducing irrelevant content. The entity information enhancement module effectively combines entity extraction results and BERT encoding information via a gating mechanism. This optimizes the input information of the relation extraction layer with entity information, thereby improving relation extraction performance. In the decoding layer, we employ a global pointer network and sparse multi-label cross-entropy to decode features and train the model, resulting in optimal ternary extraction results. We conducted experiments on the NYT and WebNLG datasets to validate the effectiveness of our MFIE model, as demonstrated through comparative and ablation experiments.

Nonetheless, in scenarios characterized by a higher quantity of relations and fewer available training samples, our model demonstrates limited enhancements compared to the baseline model. In future research, we will delve deeper into optimizing the synergy between entity and relational information, striving for greater efficiency in the model’s capacity to identify intricate entity and relation categories.