Introduction

Information extraction is a fundamental task in natural language processing (NLP) and relation extraction is an important sub-task in information extraction [1, 2]. An early application scenario is to extract the relation between two entities in a single sentence, and this kind of work has been successful [3,4,5]. In recent years, the more challenging document-level relation extraction (DRE) has attracted the attention of scholars, and more work has started to focus on this task [6]. DRE plays a crucial role in knowledge acquisition of unstructured documents, facilitating many NLP down-stream tasks such as knowledge graphs, recommendation systems, and semantic search (Fig. 1).

Fig. 1
figure 1

An example of DocRED

Many previous document-level relation extraction methods use dependencies and basic rules to generate document graphs and then use graph neural networks to make the inference, such as EoG [7], GAIN [8], DRN [9], and LSR [10]. However, these approaches have drawbacks. For example, the graph structure they exploit suffers from inadequate extraction of semantic information. Some works try to solve the task directly using language models (transformer, etc.) [11, 12]. These works usually take an equal view of the mention of different entities and suffer from the problem of over simplicity in the way they fuse features of the same entities. A few works focus on the distinction between positive and negative samples but not on the problem of category imbalance. Finally, only very little work currently considers the problem of lexical confusion which is a crucial issue in semantic-based NLP tasks.

The differences between document-level relation extraction and sentence-level relation extraction are mainly the following three points. (1) The first point is that entities are richer and distributed in long texts, and the same entities may appear in different sentences. We need more efficient ways to model the entities in a document that need to determine relationships. (2) The second point is that the long text increases the difficulty of reasoning about the relations between entities. For example, some relations must be extracted across five or more sentences for reasoning. Therefore, we need to use more precise reasoning to infer the corresponding relation labels. (3) The third point is that the data distribution is very uneven. For example, each document in DocRED [6] is annotated with named entity mentions, co-reference information, intra- and inter-sentence relations, and supporting evidence. There are about 10% of the relation types occupy nearly 60% of the total sample size. How to alleviate the uneven sample distribution is a key point for improving performance on this task.

To solve the above three problems, we propose a novel document-level relation extraction method, dubbed SKAMRR (sememe knowledge-enhanced abstract meaning representation and reasoning), as illustrated in Fig. 2. The model builds on PLM, constructs document-level AMR graphs, and models effective semantic associations. The model also constructs a document-level entity graph to reason correctly about the entity-pair belonging relations. Furthermore, the approach in this paper devises new loss functions to mitigate the problem of uneven distribution of data in the dataset. Intuitively, the document-level AMR graph is a core information extraction graph in which document nodes, sentence nodes, AMR nodes, and entity mention nodes are connected by different types of edges to simulate clustering information. Relation inference is an operation built on the entity mention graph (where entity mentions nodes are obtained from the document-level AMR graph) and is a semantically active graph designed to model the information about the relations that exist between different entity pairs.

Fig. 2
figure 2

The overall architecture of SKAMRR. First, the input document is subsequently encoded through BERT. Then, the sememe knowledge-enhanced AMR graph generates the head and tail entity representation. Next, we construct a entity-pair graph and use GIN to model the graph interaction. Finally, the classifier predicts relations of all the entity pairs and calculate model loss by GAL we proposed

First, we fuse the sememe information into word representation including entity mentions which can alleviate the problem of lexical confusion and generate sentence-level abstract meaning representation (AMR) graphs based on the sentences in a document and then generate document-level AMR graphs based on the rules. Document-level AMR graphs are rooted, annotated, directed, and acyclic graphs that represent high-level semantic relationships between abstract concepts of unstructured concrete natural text. AMR is a high-level semantic abstraction. In particular, different sentences that are semantically similar may share the same AMR parse output, which may also automatically filter and exclude some information that is unnecessary to the model to some degree.

Second, we build entity-pair graphs and use graph neural networks (GNN) to obtain vector representations of the nodes in the graph. Specifically, we obtain the representation of entity mention nodes from the document-level AMR graph and construct the entity pair graph, which is the core inference graph of this method. As part of the inference graph construction process, in this paper, the same entity nodes in pairs of entity nodes are connected to realize the document-level inference graph (multi-hop inference). Then, SKAMRR obtains entity representations enhanced by core inference with contextual feature information through graph isomorphism network [13], which facilitates capturing long-range relation information.

Finally, we design a global adaptive loss function to solve the problem of long-tail data. In addition, the problem of uneven distribution of data in the document-level relation extraction dataset is particularly apparent in DocRED, the dataset often used for this task.

The main contributions of this paper are summarized as follows:

  • We construct a document-level AMR heterogeneous graph. This graph structure can well model the abstract semantic information at the document level. Because of its own advantages, it can automatically assist the model in performing the filtering of repetitive information.

  • We introduce sememe knowledge in the document-level graph to improve the lexical representation of words at a finer level by merging different lexical senses and sememe information.

  • We design a novel loss function named global adaptive loss (GAL). This function can mitigate the impact of long-tail effect on the model performance in the dataset and can improve the generalization ability of the model.

  • SKAMRR outperforms the baseline models on four document-level relation extraction datasets (DocRED, CDR, GDA, and HacRed). Our experimental results demonstrate the efficacy of our method achieving competitive performance.

Organization. The rest of this paper is organized as follows. In the section “Related Work”, we introduce the related work including document-level relation extraction, graph neural networks and abstract meaning representation. In the section “Our Method”, we describe our proposed method in detail. The section “Experiments” gives the experimental setup and results. In the section “Conclusion”, we conclude the entire paper.

Related work

Document-level relation extraction

Document-level relation extraction can generally be classified into two categories: document graph-based and sequence-based approaches. The graph-based approaches mainly use words or entities as graph nodes, construct the document graph by learning the latent graph structure of the document, and continue to infer using graph neural networks. Reference [14] first proposes the use of constructing document graphs to solve relation extraction across sentences. Reference [7] constructs heterogeneous graphs with three kinds of nodes and five kinds of edges. Reference [10] uses the matrix tree principle for heterogeneous networks to construct the same expression using the interaction of attention and iteratively updates the matrix by inducing structure. Reference [15] performs the DRE by learning a pronoun–mention graph representation, from which the derived graph can model the relation among pronouns and mentions to infer the relations. References [8, 9, 16,17,18,19,20,21] all predict relations by constructing document graphs and devising a way to reason based on graph representations. There is another class of methods that mainly take a sequence-based model [19, 22]. As the transformer model has been used in the NLP field in recent years, more relation extraction methods have been applied to this model. Since the sequence-based transformer can model long-range sequences, such methods do not introduce graph structure. Reference [11] incorporates structural dependencies into the encoder network and can perform both context reasoning and structure reasoning. Reference [23] introduces a localized context pooling technique to solve the problem of using the same entity embedding for all entity pairs and proposes adaptive threshold Loss for long-tail data. Reference [24] proposes an entity knowledge injection framework to enhance DRE task by introducing co-reference distillation and representation reconciliation. Reference [25] proposes densely connected criss-cross attention network, which can collect contextual information in horizontal and vertical directions on the entity-pair matrix to enhance the corresponding entity-pair representation. Reference [26] builds upon co-reference resolution and gathers relevant signals via multi-instance learning. There has also been some recent work based on contrastive learning which focuses on issues of long tails and data noise in data sets. In this paper [27, 28], we use a graph structure-based approach, because graph structured data has a natural advantage for performing inference, both to accurately model documents and to capture semantic relationships between long-distance entities in more detail. To the best of our knowledge, we are the first to apply the AMR graph to the task of DRE.

Graph neural networks

The graph neural networks (GNN) have attracted increasing attention recently. While traditional neural networks are more suitable for data in Euclidean space, GNN can use neural networks in graph structures. There are many types of graph neural networks, including graph convolutional networks (GCN) [29], graph attention networks (GAT) [30], GIN [13], etc. GNN can be utilized in non-structural data where the graph structure is latent including the tasks of computer vision and natural language processing. In recent years, many works in NLP have applied for GNN techniques, such as text classification [31, 32], question answering [33, 34], text generation [35, 36], abnormal text detection [37, 38], etc. We employ three kinds of GNN to accomplish document-level relation extraction in our works. In this paper, we use graph neural networks to perform relational entity pair inference and fusion operations of sememe information.

HowNet and abstract meaning representation

HowNet is one of the most famous sememe knowledge bases, constructed in more than 20 years. A sememe is the minimum semantic unit in linguistics, and some linguists hold that the meanings of all words in a language can be represented by a limited set of sememes [39]. Hownet contains many Chinese and English words with word meanings and sememe information. The sememes of senses in HowNet are annotated with various relations and form hierarchical graph structures. In our work, we only consider all annotated sememes sets of each sense without considering their internal relations. HowNet proposes that annotated sememes can represent senses and words well in a real-world scenario. Sememes are helpful for many NLP tasks [40,41,42]. OpenHowNet API [43] is developed by THUNLP, which provides a convenient way to search information in HowNet, display sememe trees, calculate word similarity via sememes, etc. In our paper, we get words’ senses and sememes by OpenHowNet API.

Abstract meaning representation (AMR) [44] is a graph-based semantic representation that captures the sentence’s semantics of “who is doing what to whom”. Each sentence is represented as an acyclic graph with labels on nodes (e.g., concepts) and edges (e.g., relations). Every node and edge of the graph are labeled according to the sense of the words in a sentence. An ID names each node in AMR. It contains the semantic concept, which can be a word (e.g., man) or a PropBank frameset (e.g., want-01) or a special keyword. The keywords have type (e.g., date-entity), quantities (e.g., distance-quantity), and logical conjunction (e.g., and). The edge between two nodes is annotated using more than 100 relations including frameset argument index (e.g., “:ARG0”), semantic relations (e.g., “:location”), etc. Several recent works in natural language processing use AMR [45,46,47]. In this paper, we have extended the sentence-level AMR to the document level for better adaptation to the task.

Our method

Research objective

The objective of this paper is to solve three RE problems at the document level: (1) the phenomenon of multiple meanings of a word plays an important role in the understanding of semantics, and the same problem exists in the dataset of document-level relation extraction. (2) There are many cross-sentence relations of entity pairs leading to long-distance dependencies. Accurate relation classification often requires strong graph modeling at the document level and a comprehensive reasoning approach. (3) Long-tail effect exists in the dataset leading to degraded model performance. We then describe how to resolve these issues and present experimental analysis.

General framework of SKAMRR

In this section, we describe our model (SKAMRR) in detail. As shown in Fig. 2, the approach in this paper consists of four parts, (1) text encoding module: using a text encoder to obtain the initial word embedding. (2) Sememe knowledge-enhanced Abstract meaning representation (AMR) module: constructing document-level AMR graphs with Sememe-enhanced word representations and obtaining fully interacting word and entity nodes’ representations. (3) Reasoning module: building entity-pair graph and performing relation reasoning. (4) Classification module: outputting relation by classification function and proposing global adaptive loss (GAL) to alleviate imbalance of the data.

Background and notation

We formulate the document-level RE as follows:

Document D : The document D is the raw text that contains multiple sentences. In addition, it makes use of a sequence of word tokens, \(\{x_1, x_2,.., x_n\}\), to represent the input of word embedding.

Entity E and Mention m: The entity set E consists of the entities that appear in the document D. The mention represents the expression that does not have an explicit entity to refer to in a document, and each mention is defined to be a span of words. For each entity \(e_i\), it is represented by a set of mentions: \(ei = \{m_1, m_2,\ldots , m_n\}\) in the document D.

Formally, a document-level relation extraction task can be denoted as \(T = \{X,Y\}\), where X is the instance set and Y is the relation labels set. For each instance, it consists of several tokens \(\{x_1, x_2,\ldots , x_n\}\). The task aims to predict the relation labels between entities, namely \(r_{h,t} = f(h^{h}_e, h^{t}_e)\), where \(h^{h}_e, h^{t}_e\) are the representations of head entity and tail entity in E, \(r_{h,t}\) is a relation label.

Text encoding module

The pre-trained language model (PrLM) as the text encoder, such as BERT [48], is used in our work. BERT has achieved amazing results on several natural language processing tasks, demonstrating its powerful modeling ability for text data. Denoting a document D of length l as the input and \(D = [x_t]^l\), where \(x_t\) means a word at position t. Following previous work, we add the special markers \(</S{\text {-category}}>\) and \(</E{\text {-category}}>\) before and after each entity mentions. Then, we can obtain the content embedding \({\varvec{H}}\)

$$\begin{aligned} {\varvec{H}} = {\text {PrLM}}([x_1,\ldots , x_n]) = [{\varvec{h}}_1,\ldots , {\varvec{h}}_n], \end{aligned}$$
(1)

where n is the length of the document after adding the special markers, and we concatenate the start and end markers of each entity mentions as its embedded representation. We use dynamic windows for long text (\(n>512\)).

Sememe knowledge-enhanced abstract meaning representation module

Sememe-enhanced word representation

We introduce sememe knowledge to enhance the lexical representation of words at a more fine-grained level by fusing different lexical meanings and sememe information. As shown in Fig. 3, a word contains multiple senses, and a sense contains multiple sememes. The structure of the sememes of one sense is available to form a graph structure. First, we employ HowNet [39, 43] to get all words’ sememes and senses information. Then, we construct the sememe graph for each word sense. We use the graph attention network (GAT) to pass and aggregate features on the sememe graph as the following equation:

$$\begin{aligned} {\varvec{h}}_{\textrm{sem}_1},\ldots , {\varvec{h}}_{\textrm{sem}_M} = GAT(v_{\textrm{sem}_1},\ldots , v_{\textrm{sem}_M}), \end{aligned}$$
(2)

where \( \textrm{sem}_1,\ldots , \textrm{sem}_M\) denote all the sememes belonging to one sense sen, \(v_\textrm{sem}\) indicates the words embedding of sememe information, and \({\varvec{h}}_\textrm{sem}\) represents the output of GAT.

We then calculate the representation of the sense by the averaging all sememe representations it have

$$\begin{aligned} {\varvec{h}}_\textrm{sen} = \frac{\sum _{k=1}^{K} {\varvec{h}}_{\textrm{sem}_k}}{K}. \end{aligned}$$
(3)
Fig. 3
figure 3

The word “apple” have three senses: apple company, apple fruit, and apple tree. “Apple Company” have three sememe information, including “PatternValue”, “IspeBrand”, and “Computer”. “Apple Fruit” has “Fruit” and “Apple Tree” has “Reproduce”

Fig. 4
figure 4

An example of a document-level AMR graph, where S1, S2, and S3 are virtual sentence nodes; See-01, chase-01, call out-01 are root nodes; and the others are concept nodes

Afterward, all sense vectors are aggregated by global attention [49] to obtain new word representations

$$\begin{aligned} a_j&= \frac{\exp (\tanh (w_{s}[{\varvec{h}}_j;{\varvec{h}}_\mathrm{sen_j}]))}{\sum ^{C}_{c=1}\exp (\tanh (w_{s}[{\varvec{h}}_i;{\varvec{h}}_\mathrm{sen_c}]))} \end{aligned}$$
(4)
$$\begin{aligned} {\varvec{h}}_i^\textrm{sem}&= \sum _{j=1}^{C} a_j {\varvec{h}}_{\textrm{sen}_j}, \end{aligned}$$
(5)

where \({\textrm{sen}_1,\ldots ,\textrm{sen}_C }\) denote the set of sense representations for the word i and C is the senses number of word i, \(w_s\) are trainable parameters, and \({\varvec{H}}^\textrm{sem} = {\varvec{h}}_1^\textrm{sem},\ldots , {\varvec{h}}_n^\textrm{sem}\) are the sememe-enhanced representations and we use those as embedding vectors instead of the initial BERT outputs.

Building document-level AMR

AMR is an effective semantic formalism in nature language and can abstract the semantics of sentences to words that contain key information. Some recent works have demonstrated that AMR can assist in natural language processing tasks [47]. To obtain adequate and critical information for relation classification, we construct a document-level AMR graph \(G^D=(V^D, E^D)\) for each document. The standard AMR graph is sentence-based, but the task is document-based and requires reasoning about the relationships between entities in different sentences across a document. First, we select all sentences in the document that contain entity mentions, and we use the AMR parsing model [50] to get the corresponding sentence-level AMR graph. The initial embedding of the node in the graph is the representation with the Sememe information fused in the previous step. For each sentence, we construct a virtual sentence node that is connected to the root node of the sentence-level AMR graph and all virtual sentence nodes are connected. A document-level AMR graph contains three types of nodes: root nodes, virtual sentence nodes, and concept (word) nodes; four types of edges: root node-virtual sentence nodes, virtual sentence node-virtual sentence node, concept node-concept node, and concept node-root node. The root nodes’ embedding is average embedding of its sentence. The concept nodes’ embedding is getting from sememe-enhanced word representation. The virtual sentence node embedding is calculated by the attention score of the last layer of BERT, which is calculated as follows:

$$\begin{aligned} a_i^{(h,t)}&= \frac{A_i^{h} \cdot A_i^{t}}{ {\textbf{1}}^\top (A_i^{h} \cdot A_i^{t}) } \end{aligned}$$
(6)
$$\begin{aligned} {\varvec{h}}^\textrm{snode}_{y}&= {\varvec{H}}_{y}^\textrm{sem} a_i^{(h,t)}, \end{aligned}$$
(7)

where \(A_i^t, A_i^h\) is the attention matrix for ith mention of head and tail entity tokens in one sentence. \(\varvec{h^\textrm{snode}_{y}}\) is the representation of yth virtual sentence node and \({\varvec{H}}^\textrm{snode} = {\varvec{h}}_1^\textrm{snode},\ldots , {\varvec{h}}_Y^\textrm{snode}\). In addition, the mentions of the same entity are also connected with edges. For the few entity mentions that do not appear in the AMR graph, we construct additional entity nodes and connect them to the virtual sentence nodes of the sentences where they are in. Figure 4 is an example of the document-level AMR graph constructed in this paper.

Then, we use R-GCN [51] to perform feature extraction on each document-level AMR graphs

$$\begin{aligned} {\varvec{h}}_{u}^{(l+1)}={\text {ReLU}}\left( \sum _{t \in {\mathcal {T}}} \sum _{v \in {\mathcal {N}}_{u}^{t} \cup \{u\}} \frac{1}{c_{u, t}} W_{t}^{(l)} {\varvec{h}}_{v}^{\textrm{sem}(l)}\right) , \end{aligned}$$
(8)

where \({\varvec{h}}_{v}^\textrm{sem} \in \{{\varvec{H}}^\textrm{snode}, {\varvec{H}}^\textrm{sem}\} \). \({\mathcal {T}}\) is an edge of different types and \(W^l_t \in {\mathbb {R}}^{d*d}\) is a trainable parameter. \({\mathcal {N}}_{u}^{t}\) is the set of neighboring nodes of node u at an edge of type t. \(c_{u,t} = |N^t_u |\) is a constant. Then, for the final representation of the node u, we use the following equation for calculation:

$$\begin{aligned} {\varvec{m}}_u = \textrm{RELU}(W_u \cdot [{\varvec{h}}_u^{(0)};\ldots ;{\varvec{h}}_u^{(N)}]) + \textrm{Max}({\varvec{h}}_u^{(0)},\ldots {\varvec{h}}_u^{(N)}), \nonumber \\ \end{aligned}$$
(9)

where \(W_u \in {\mathbb {R}}^{d*Nd}\) is the trainable parameter. \({\varvec{m}}_u\) is the node u representation.

Reasoning module

Our approach fully interacts with the beneficial information in the document-level AMR in the previous module. In the inference module, we consider the multi-hop phenomenon (E.g., a document has four entities A, B, C, D, where (A, C) has relation a, (C, D) has relation b, and (D, B) has relation c. Then, it can be inferred that (A, B) has relation d.) of entities and entity mentions for document-level relation extraction. We employ connecting the first-order neighbors of entity mentions and generating inference graphs, and then use the GNN to obtain relation representations between multi-hop neighbors. We fuse multiple entity mentions of one entity, which is implemented using LogSumExp pooling [23] with the following equation:

$$\begin{aligned} {\varvec{h}}_{e_i} = \log \sum _{j=1}^{N_{e_i}} \exp ({\varvec{m}}_j), \end{aligned}$$
(10)

where \(m_j\) denotes representation of jth mention of ith entity.

We construct the entity pairs graph for one document. In particular, one entities pair (head and tail entity) is regarded as a node. If one of the two entities contained in an entity-pair node is the same, then these two entity-pair nodes are connected. The formula for the feature representation of the entity-pair nodes is as follows:

$$\begin{aligned} {\varvec{h}}_{(h, t)}^{h}&=\tanh \left( W_{h} {\varvec{h}}_e^h + b_h \right) \end{aligned}$$
(11)
$$\begin{aligned} {\varvec{h}}_{(h, t)}^{t}&=\tanh \left( W_{t} {\varvec{h}}_e^t + b_t \right) \end{aligned}$$
(12)
$$\begin{aligned} {\varvec{h}}_{(h, t)}^{r}&= {{\varvec{h}}_{(h, t)}^{h}}^{T} W_{p} {\varvec{h}}_{(h, t)}^{t}, \end{aligned}$$
(13)

where \(W_h \in {\mathbb {R}}^{d*d}\), \(W_t\in {\mathbb {R}}^{d*d}\), and \(W^i_p \in {\mathbb {R}}^{d^{2}} \) are learnable parameters. \({\varvec{h}}_{(h, t)}^{r}\) is the vector representation of entity-pair node.

Finally, we utilize a GNN to encode the entity pairs graphs to extract the relation information. Given a entity pairs graph g, representation after the graph encoder is as below

$$\begin{aligned} {\varvec{h}}_{{(h,t)}_f}^1,\ldots , {\varvec{h}}_{{(h,t)}_f}^S = G({\varvec{h}}_{(h,t)}^1,\ldots , {\varvec{h}}_{(h,t)}^S), \end{aligned}$$
(14)

where G() is the graph encoder, and here, we use a state-of-the-art graph isomorphism network [13] for its strong representation ability. \({{\varvec{h}}_{(h,t)}^S}\) denotes the initial node representations which are calculated above.

Classification module

We first concatenate the entity-pair representation and the two entity representations to generate the final representation for relation classification

$$\begin{aligned} {\varvec{r}}_{(h, t)}=\left[ {\varvec{h}}_{e}^{h}; {\varvec{h}}_{e}^{t}; {\varvec{h}}_{{(h, t)}_f}\right] , \end{aligned}$$
(15)

where \({\varvec{h}}_{e}^{h}\) and \({\varvec{h}}_{e}^{t}\) are computed by Eq. (10). \({\varvec{h}}_{{(h, t)}_f}\) is getting from Eq. (15).

Then, we adapt a linear layer for predicting relations

$$\begin{aligned} l_{f}^{(h, t)}={\textbf{W}}_{f} {\varvec{r}}_{(h, t)} + b_{f}, \end{aligned}$$
(16)

where \(l_{f} \in {\mathbb {R}}^{c}\) denotes the output logits for all relations, \({\textbf{W}}_{f} \in {\mathbb {R}}^{d \times c}\) is the weight matrix that maps the relation embedding to the each class, and c is the number of label categories.

Document-level relation extraction is essentially a multi-label classification problem, and [23] proposes adaptive thresholding loss (ATL) to solve the multi-label problem. ATL is designed with a special category TH as the adaptive threshold, with positive cases above TH and negative cases below or equal to TH. The original version of the loss function is formulated as follows:

$$\begin{aligned}&P\left( r_{i} \mid e_{h}, e_{t}\right) =\frac{\exp \left( l_{f}^{(h, t)} \right) }{\sum _{r^{\prime } \in {\mathcal {P}}_{T} \cup \{T H\}} \exp \left( l_{f}^{(h, t)^{'}} \right) } \end{aligned}$$
(17)
$$\begin{aligned}&\quad {\mathcal {L}}_{1} = -\sum _{r \in {\mathcal {P}}_{T}} \log (P\left( r_{i} \mid e_{h}, e_{t}\right) \end{aligned}$$
(18)
$$\begin{aligned}&\quad {\mathcal {L}}_{2}=-\log \left( \frac{\left. \exp \left( l_{f}^{(h, t)^{T H}} \right) \right) }{\sum _{r^{\prime } \in {\mathcal {N}}_{T} \cup \{T H\}} \exp \left( l_{f}^{(h, t)^{'}}\right) }\right) \end{aligned}$$
(19)
$$\begin{aligned}&\quad {\mathcal {L}}= {\mathcal {L}}_{1} + {\mathcal {L}}_{2}, \end{aligned}$$
(20)

where positive classes \({\mathcal {P}}_{T} \subseteq R \) are the relations that exist between the entities in T. If T does not express any relation, \({\mathcal {P}}_{T}\) is empty. Negative classes \({\mathcal {N}}_{T} \subseteq R \) are the relations that do not exist between the entities. If T does not express any relation, \({\mathcal {N}}_{T} = R \).

We use the idea of gradient harmonizing mechanism (GHM) [52] to balance the possibility of positive examples and propose gradient adaptive loss (GAL) to enhance the effect of ATL. Our loss function’s design intuition is to keep the model from focusing more on hard-to-classify (outliers) and hard-to-classify samples. Gradient density is introduced to measure the number of samples appearing in a specific gradient range, so that the update of samples per gradient becomes more balanced

$$\begin{aligned}&{\mathcal {L}}^{'}= \alpha _{1} \cdot {\mathcal {L}}_{1} + {\mathcal {L}}_{2}\end{aligned}$$
(21)
$$\begin{aligned}&\quad \alpha _{1} = \frac{N}{GD{(g_r)}}, \end{aligned}$$
(22)

where \({\mathcal {L}}^{'}\) is our loss named GAL. \(GD(g)=\frac{1}{l_{\varepsilon }(g)} \sum _{k=1}^{N} \delta _{\varepsilon }\left( g_{k}, g\right) \), denotes the gradient density. \(\delta _{\varepsilon }\left( g_{k}, g\right) \) denotes N samples in each batch’s slicing. The parameter \(\alpha _{1}\) enables to transform the adjustment of the gradient to the loss function. We can achieve the optimization for loss by adjusting the value of \(\alpha _{1}\).

Experiments

The goal of our experiments is to show that (1) our model can capture important sentence-level features as well as document-level features of relevant entity pairs and combine these features for inference to obtain document-level relation extraction results, and (2) our proposed loss function can mitigate the impact of the imbalanced sample distribution on the performance of the model. In this section, we first introduce four document-level relation extraction datasets. We then give some model parameters in this paper as well as the baseline model used for experimental comparison. We conclude by evaluating the model and using ablation experiments to illustrate the robustness and efficiency of the model architecture in this paper.

Dataset statistics

DocRED is a large-scale manually annotated document-level RE dataset constructed from Wikipedia and Wikidata with two features. (1) It contains 132,375 entities and 56,354 relationship facts annotated on 5,053 Wikipedia documents. (2) Since at least 40.7% of the relations in DocRED can only be extracted from multiple sentences, DocRED needs to read multiple sentences in a document to identify entities and reason about their relations. The dataset contains 3053, 1000, and 1000 instances as the training set, validation set, and test set, respectively. Reference [53] creates the Chemical-Disease Reactions dataset (CDR). It contains one kind of relation: Chemical-Induced Disease between chemical and disease entities. The dataset contains 500 documents for training, 500 for development, and 500 for testing. The Gene-Disease-Associations dataset (GDA) is created by [54]. It has one kind of relation which is “Gene-Induced-Disease” between gene and disease. We split the dataset in a normal method, 23,353 documents for training, 5839 for development, and 1000 for testing. HacRED is a large-scale dataset with reasonable data distribution which focus on the hard cases of relation extraction. We also select 6231 samples as the training set, 1500 as the validation set, and 1500 as the test set (Table 1).

Experiment settings and evaluation metrics

We use PyTorch [55] and DGL [56] frameworks to implement the model in this paper. For the DocRed dataset, we utilize BERT-large [48] and RoBERTa-large [57] as the initial encoders for the documents, and Xu’s model [50] as the AMR generator, respectively. For the CDR dataset, we use BioBERT-Base v1.1 as the encoder, and We employ the transformer-based AMR parser [58] that is pre-trained on the Biomedical AMR corpus. The model parameter optimizer we use is AdamW [59]. We set the initial learning rate for all encoder modules to 2e−5, other modules to 1e−4. We make the embedding dimension and the hidden dimension to 768. Our method’s GNNs encoders have three layers and the hidden size of node embedding is 768. Our model is experimented with NVIDIA RTX 3090 GPU. Following previous work [6, 8], we take micro F1 and micro Ign F1 as the evaluation metrics for experimental performance. Ign F1 is the F1 metric after excluding the effect of the presence of the same entity relation pairs in the development/test set and the training set.

Table 1 Dataset details

Compared methods

We compare multiple models, which can be classified into graph-based and non-graph-based approaches. We label Bert-base as Bb, Bert-large as Bl, and Roberta-large as Rol.

Graph-based methods:

LSR [10] is an end-to-end document-level relation extraction approach that treats the graph structure as a potential variable and corrects that graph at each iteration step.

GEDA [17] considers the attention between sentences and potential relation instances as a many-to-many relationship and therefore introduces a bi-attention mechanism, including the attention of sentence-to-relation and relation-to-sentence.

GCGCN-BERT [60] proposes a novel graph convolutional networks, which have two hierarchical blocks: context-aware attention guided graph convolution for partially connected graphs and multi-head attention guided graph convolution for fully connected graphs.

GLRE [16] models entity pairs by encoding document information into global and local representations as well as contextual relation representations.

HeterGSAN [19] manages to reconstruct path dependencies from graph representations to ensure that the proposed DocRE model is more concerned with encoding pairs of entities with relations in training.

SIRE [21] represents intra- and inter-sentential relations differently and designs a straightforward form of logical reasoning that can cover more logical reasoning chains.

DRE [9] designs a discriminative inference network for estimating the relation probability distributions of different inference paths and then models the inference method for the relation between each entity pair in the document.

CGM2IR [61] proposes context-guided coreferential mentions integration in a weighted sum manner and inter-pair reasoning.

Non-graph-based methods:

BERT [62] solves the DRE task in phases that can improve performance, the first step is to predict whether two entities are related, and the second is to predict the specific relation.

HINBERT [63] proposes a hierarchical inference network (HIN) for document-level inference, which can aggregate inference information from entity level to sentence level and then document level.

CorefBERT [22] adds a mention reference prediction (MRP) pre-training task to achieve the purpose of fusing co-reference information in the pre-trained model.

SSAN [11] argues that structural dependencies should be incorporated within the encoding network and throughout the system, leading to the proposal of structured self-attention network, which can effectively model these dependencies within its construction blocks and in all network layers from the bottom up.

ATLOP [23] proposes localized context pooling structure and adaptive thresholding to solve the multi-label and multi-entity problem.

MRN [20] offers a mention-based reasoning network to distinguish the impacts of close and distant entity mentions in relation extraction and consider the interactions between local and global contexts.

DocuNet [12] analogize the DRE to the semantic segmentation task in computer vision, and use the U-shaped module to capture the global interdependencies between the triples on the image-style feature graph.

In CDR and GDA dataset, we compare our SKAMRR model with six baselines, including EoG [7], DHG [18], LSR, MRN, ATLOP, and CGM2IR.

Main results

Table 2 Results on the development and test set of DocRED
Table 3 Experimental results on HacRED

Results on DocRED: We have conducted many experiments, and the results are presented in Tables 2, 3 and Figs. 5, 6. In DocRED, we can find that our model SKAMRR is better on both Dev and test with the baseline model. In the document graph-based approach, SKAMRR outperforms the best-performing model SIRE in both F1 and Ign F1 metrics when using BERT-base as the document encoder, and our model SKAMRR outperforms the GAIN model when using RoBERTa-large as the document encoder. The best-performing models in document-level relation extraction are sequence-model-based approaches, represented by DocuNet and ATLOP. Our model SKAMRR outperforms both BERT-base and RoBERTa-large in comparison metrics F1 and Ign F1 when they are the document encoders, respectively. When BERT-base is the encoder, SKAMRR outperforms DocuNet by 2.3% and 2.34% on F1 and Ign F1 in test, respectively. When RoBERTa-large is the encoder, SKAMRR outperforms ATLOP by 4.38% and 4.65% on F1 and Ign F1 in test, respectively. The gap between Ign F1 and F1 woes is also smaller when document encoders are used in this paper’s model, showing that SKAMRR has good generalization and generality. Also, the performance when using RoBERTa-large as the encoder is better than using Bert-base, which shows the power of the pre-trained model, and the task will gain with the development of the pre-trained model at a later stage.

Fig. 5
figure 5

Results on the test set of CDR. The results of baselines are from their original papers

Fig. 6
figure 6

Results on the test set of GDA. The results of baselines are from their original papers

Results on CDR and GDA: The results on the CDR dataset by F1 score are shown in Fig. 5. It can be observed that among the methods, the graph-based methods (CGM2IR and ours) perform better in extracting the relation. These phenomena demonstrate that the graph structure can better preserve the interaction between different elements in the document, which can help the model to correctly classify the cross-sentence relation. Furthermore, our method achieve the best performance on all the metrics, which demonstrates the effectiveness of SKAMRR. In particular, it states that not only can our method automatically learn multi-hop paths for inter-sentence relationships, but also identify the semantic path within the sentence for intra-sentence extraction. As can be observed from the experimental results presented in Fig. 6, our SKAMRR achieves 84.2 on the GDA, which is also better than nearly all of the methods. On the other hand, the SKAMRR metric is slightly. smaller than CGM2IR (\(-\) 0.5%), which is primarily due to the presence of fewer inter-sentence relations in the GDA dataset (only 13% compared to 30% in the CDR dataset), which results in the under learning of the SKAMRR model. The method is also effective for document-level relation extraction in the biomedical domain.

Results on HacRED: The experimental results based on the HacRED dataset are shown in Table 3. In this paper, we have selected three baseline methods; ATLOP and GAIN, which represent the best performance of the graph-free and graph-based methods, respectively. We are decorrelating to the open source code supplied in the original paper for this experiment. Our proposed method is able to outperform the ATLOP baseline in all metrics, higher than 1.2% in accuracy, 0.62% in recall, and 0.77% in F1 value, respectively. Not only does the model perform well, but for all the methods compared, HacRED’s performance is clearly superior to its performance on DocRED. Normally, because the HacRED dataset focuses on hard relations, while the DocRED dataset is more general, the model should be less effective. The reasons for this phenomenon are as follows: (1) the HacRED dataset has significantly more samples with annotations than DocRED, which also makes the model more fully trained and makes the model have better generalization. 2) There are only 26 relation categories in HacRED and the data are highly distributed, which significantly reduces the presence of fewer sample data, also making the model more fully trained.

Furthermore, we utilize four models, namely LSR, GAIN, ATLOP, and SKAMRR, to create the critical distance diagram. Based on Fig. 7, it is evident that the SKAMRR, which is introduced in this paper, outperforms the other models. The model in this paper achieves competitive results in all the four datasets. Experimental results show that the model of this paper can explore feature information well both within and across sentences, and accurately infer classes of relations between entities by the inference method devised in this paper.

Fig. 7
figure 7

Critical distance diagram for LSR, GAIN, ATLOP, and SKAMRR

Ablation study

Table 4 Ablation study of SKAMRR on the Dev set of DocRED, where “w/o” indicates without
Table 5 Results for long-tail type relations which contains ten relations, where “w” indicates with

We design the corresponding ablation experiments for the structure and contribution of our method. Our ablation experiments are divided into two parts. First, we perform the experiment for model’s structure, involving the AMR module, the sememe knowledge, and the loss function (GAL). Table 3 shows the results. First, we use the (GAIN) scheme instead of AMR to construct the document graph, and we can observe this leads to a 1.37% drop in Ign F1 and 1.4% in F1. Then, we remove the Sememe information from the model, and it also leads to a decrease of the model by 1.11% on Ign F1 and 1.12% on F1. Afterwards, We replace the loss function with the conventional adaptive thresholding loss [19], the F1 and Ign F1 decreased by 1.14% and 1.18%. Finally, we remove all three components mentioned above, and the experimental metrics dropped even more, with a 1.93% drop in Ign F1 and a 2.07% drop in F1. The ablation experiments can demonstrate that the building block of the AMR graph structure used in this paper plays a key role, and the effect of the model decreases the most if this part is removed. The above experiments also show the effectiveness of sememe knowledge fusion, and loss function (GAL) that we used, and the experimental results depend on each part of the method.

In the second part, we create the long-tail data in the dataset, containing 86 relations. The experiment is to verify the validity of GAL proposed in this paper. The results of the experiment are shown in Table 4. We can see that the F1 value of our model decreases by 0.91% and 0.65% when we use normal CE and ATL. This proves that the GAL proposed in this paper can improve the model’s performance on the class sample imbalanced dataset and mitigate the impact of the imbalanced sample distribution (Table 5).

Parameter sensitivity

Since we use GNN in sememe information fusion, AMR graph modeling, and entity-pair graph inference, we need to experimentally validate with respect to the key parameter in GNN (the number of interaction steps). The performance of our SKAMRR method is influenced by the number of interaction steps, so we can choose multiple numbers to verify our approach. This part compares different numbers of interaction steps to analyze which number of interactions yields the best performance. In particular, we compare numbers \(\{0, 1, 2, 3, 4, 5\}\). It can be seen from Fig. 8 that three times of interaction step achieves the best performance among all the compared numbers. In addition, the result of number 2 is also satisfactory, indicating that we can consider 2 interactions if our computational resources are limited. This result demonstrates the robustness of our choice of the number 3 for the interaction step.

Fig. 8
figure 8

Results of different number of interactions on the test set of DocRED

Conclusion

In this paper, we propose a document-level relation extraction method-SKAMRR. Instead of simply using entities or words to build graphs, we employ AMR as the basis for building document graphs using sememe-enhanced word representation and interacting with helpful information through the document-level AMR graphs. Afterward, we get the entities’ features and build the entity pairs graph for relation reasoning. Finally, we design a global threshold adaptation loss function that can alleviate the problem of unbalanced category samples in the dataset. Experimental results show that SKAMRR achieves very competitive performance in both real-world datasets, which verified its effectiveness. For future work, (1) use graph comparison learning to improve the performance of document-level relationship extraction tasks based on AMR graphs; (2) design a unified framework that unifies the graph interaction process at different stages, so that both the interaction purpose and the computational complexity can be achieved; (3) continue to explore new loss functions to better solve the problem of uneven data distribution that exists in the dataset (long-tail data).