1 Introduction

The extraction of semantic relationships between chemical and disease entities has wide-ranging applications in biomedical research and healthcare, including toxicology studies, drug discovery, and drug safety surveillance [1]. To encourage active involvement from natural language processing (NLP) community, the BioCreative V Community [1] has launched the chemical-induced disease relation extraction (CID) challenge.

The CID relation extraction is usually formulated as a binary classification problem on recognized biomedical named entities. Given a document of multiple sentences, a list of named entity mentions, and pairs of chemical and disease concept identifiers, the objective is to determine whether a chemical–disease identifier pair has a CID relation. Figure 1 illustrates an example of CID relation extraction.

The CID extraction task, however, presents several challenges. First, the entity relations are annotated at concept level rather than the mention level in a document, where an entity can be mentioned multiple times in different sentences. As a result, interactions between entities are often transcend sentence boundaries [2], referred to as inter-sentence relations. Consequently, recognizing inter-sentence relations is much more intricate than identifying intra-sentence relations. To successfully tackle the task, the model demands a comprehensive understanding of cross-sentence dependencies and discourse information [3, 4].

Second, as entities can be mentioned across sentences, another challenge is to identify long-distance CID relations, where a chemical–disease mention pair is separated by hundreds of tokens. Conventional RNN-based models, including long short-term memory (LSTM) [5] and GRU [6], struggle to capture information from very long word sequences [7]. To address this issue, early works have proposed alternative approaches that employ either convolutional neural networks [8] or graph-based neural models [3] with advanced contextual word embeddings on document-level dependency graphs.

Fig. 1
figure 1

An illustration of the state transition process

In this work, we propose a graph LSTM-based neural model called REGREx for extracting CID relations at the document level. Our model leverages global dependency information across multiple sentences to improve the performance. It facilitates the information exchange among nodes through dependency connections, thus enhancing long-distance relation extraction. To control information flow, we extend the LSTM network’s gating mechanism within the graph. We also enhance the input representation by incorporating advanced biomedical contextual word embeddings. Finally, we jointly train relation extraction with named entity recognition tasks to enhance the model robustness. Experimental results demonstrate superior performance with state-of-the-art methods for CID relation extraction.

2 Related Work

Deep learning models have emerged as dominant methods for various NLP tasks. Convolutional neural networks (CNN) [9, 10] and long short-term memory (LSTM) [5] have been utilized for CID relation extraction. [11] archived good performance by employing both LSTM and CNN models on traditional word embeddings [12], and some other kinds of linguistic features. [13] introduced a CNN model that extracted syntactic features from local SDPs to predict intra-sentence CID relations. [8] enhanced CNN with character-based word embeddings for CID relation extraction. The multi-head self-attention mechanism proposed by Vaswani et al. [14], which enables models to capture important and complex long-range dependencies from diverse contextual perspectives of input sequence simultaneously, has significantly contributed to the advancement of downstream NLP tasks, including CID extraction. Inspired by this concept, Sahu et al. [2] introduced a bi-affine multi-head self-attention model specifically for CID extraction, yielding on-par performance to other well-established CNN and LSTM-based methods.

Recent advanced approaches in the field have endeavored to overcome the inherent limitation of intra-sentence linguistic features. They often involve constructing various unified graphs as representations for each document, upon which a diverse range of graph-based neural approaches have been proposed. Notably, Sahu et al. [3] introduced a graph convolutional network (GCN) model [15] on a document-level graph of which edges encode both dependencies and co-references. This work has demonstrated great promise in effectively extracting inter-sentence semantic relations. In another study, Wang et al. [16] combined GCN with a multi-head self-attention mechanism on the document-level dependency graph. Additionally, some studies integrate other graphs at different levels, such as GRACR by Liu et al. [17] with an entity-level graph and MHGNN by Wang et al. [18] with three graphs, including a word-level graph, a mention-level graph, and an entity-level graph.

Lu et al. [19] constructed a hybrid graph that merges syntactic and abstract meaning representation graphs, along with hierarchical concentrative attention, effectively capturing and prioritizing long-distance important information for CID relation prediction. Li et al. [20] presented MRN, the mention-based reasoning-based model that incorporates both local and global reasoning, and a co-predictor module to predict CID relations. Nan et al. [21] introduced LSR, the latent structure refinement-based model that dynamically learns the document graph. LSR performs end-to-end predictions without relying on syntactic trees or heuristics, enhancing the extraction of CID relations. Shi et al. [22] proposed HGNN, a method leveraging Heterogeneous Graph Neural Networks. HGNN utilizes temporal convolutional networks and graph transformer networks to capture long-distance dependencies and enhance potential interactions between entities. Zhang et al. [23] presented DHG, the model based on dual-tier heterogeneous graph. DHG incorporates a structure modeling layer and a relation reasoning layer for multi-hop reasoning and decision-making. Xu et al. [24] introduced SSAN, a structural self-attention network-based model. SSAN integrates unique dependencies between mention pairs using self-attention, enhancing the overall encoding process.

Very recently, an innovative approach for document-level semantic relation extraction has been proposed by transforming the extraction task into a question answering task. Chen et al. [25] introduced RC, a model that follows this approach. After the transformation, RC leverages reading comprehension and prior knowledge to improve the document-level extraction process.

3 Method

In this section, we describe the details of our proposed model, which consists of six modules as depicted in Fig. 3. First, we describe how the document-level dependency graph is constructed for representing the entire input document. Second, we describe the method for converting each word in the original document into a vector, which serves as the input representation for the document-level dependency graph. The third sub-section focuses on the traditional LSTM network architecture, which is used to gather and enrich contextual information for each word in the paragraph. The fourth sub-section explains the process of how the graph’s state transition occurs. In the fifth sub-section is the detail of how the model makes entity-level prediction for CID relations. The following sub-section outlines the named entity recognition task. Finally, we provides the details on how the model is trained.

3.1 Document-Level Dependency Graph

The document-level dependency graph is the core component of our proposed method. To construct this unified graph, we generate a dependency tree for every sentence in the input document. Within the dependency tree, each node is connected to either its parent or descendant through syntactic dependencies (e.g., nsubject, case, and det). To create the cross-sentence dependency graph for the document, we connect the roots of two dependency trees for two consecutive sentences. This new edge type is referred to as next-sent. Additionally, we enrich our document graph by introducing an additional connection called next-node, which links two adjacent nodes together. Finally, we allow each node to connect to itself by a special edge that we denote as self-edge. Figure 2 depicts an example of our document-level dependency graph.

Formally, let us consider our document dependency graph \(\textbf{G} = ( \textbf{V}, \textbf{E} )\), where \(\textbf{V}\) and \(\textbf{E}\) represent the sets of all nodes and edges, respectively. \(\textbf{V}\) denotes the set of all word tokens in the document, whereas \(\textbf{E}\) encompasses a large number of edges belonging to one of four types below:

  • Syntactic Dependency Edge: Represents the syntactic relation between two nodes in the dependency tree.

  • Next-Sent Edge: Connects two consecutive sentences to capture cross-sentence dependencies.

  • Next-node edge: Connects adjacent nodes within the same sentence to capture local dependencies.

  • Self-edge: Allows nodes to link to themselves.

Fig. 2
figure 2

An example of our dependency graph. For simplicity, we only consider two consecutive sentences and omit all next-node edges and self-edges

3.2 Input Representation

In this section, we present how we construct the input word vectors for our document-level graph LSTM. Let us denote \(x_{i} \in \mathbb {R}^{d}\) as the embedding representation of ith token in the input sequence \(w_1, w_2,\ldots , w_n\). We build \(x_{i}\) by combining four types of embeddings, including contextual word embedding \(e_{w_i} \in \mathbb {R}^{d_1}\), character embedding \(c_{w_i} \in \mathbb {R}^{d_2}\), part-of-speech (POS) embedding \(p_{w_i} \in \mathbb {R}^{d_3}\), and distance embedding \(d_{w_i} \in \mathbb {R}^{d_4}\). As a result, we have \(d = d_1 + d_2 + d_3 + d_4\), and

$$\begin{aligned} x_{i} = e_{w_i} \circ c_{w_i} \circ p_{w_i} \circ d_{w_i}. \end{aligned}$$
(1)

Here, \(\circ \) denotes the concatenate operation.

Contextual Word Embedding: To generate the contextual representation for each token in the input document, we utilize a biomedical version of ELMo [26] that has been pre-trained on 10 million PubMed abstracts, comprising a total of 2.46B tokens. The word vector \(e_{w_i}\) is a dense representation in a \(d_1\)-dimensional space. In recent years, contextual word embeddings, such as ELMo [27], Flair [28, 29], BERT [30], and auto-regressive language modeling [31], have exhibited significant performance improvements for various natural language processing (NLP) tasks, including text classification [32], question answering [30], and named entity recognition [28]. Integrating these powerful bio-context-sensitive word embeddings with deep neural networks has the potential to enhance the performance of the CID extraction task.

Fig. 3
figure 3

An illustration of the convolutional neural network (CNN)-based character embeddings [33]

Character Embedding: Previous studies have demonstrated that character-based word embeddings enable models to capture unknown words and word morphology features [33, 34]. In the biomedical domain, we often encounter complex terminologies, such as chemical, protein, or gene names that exhibit rich morphological structures. Following the approach proposed by [8], we use a simple CNN layer with \(d_2\) filters applied to a sequence of character embeddings, each of which has a dimension of \(d_5\). To get the character-based representation \(c_{w_i}\), we apply a max-pooling layer after this CNN layer to capture the most salient features. Figure 3 depicts our CNN-based character character embeddings.

POS Embedding: In addition to the word embedding and the character embedding, we also embed the Part-of-Speech (POS) information into the input representation. The POS embedding \(p_{w_i}\) is randomly initialized as a \(d_3\)-dimensional vector.

Distance Embedding: We enrich the input representation by incorporating absolute distances (in term of tokens) from the current token \(p_{w_i}\) to two target entities. The distance embedding \(d_{w_i}\) consists of two sub-vectors \(d^C_{w_i} \in \mathbb {R}^{d^C_4}\) and \(d^D_{w_i} \in \mathbb {R}^{d^D_4}\). These sub-vectors encode distances from \(w_i\) to the Chemical and Disease entities, respectively. Both of them are randomly initialized. Formally, we have \(d_{w_i} = d^C_{w_i} \circ d^D_{w_i}\), where \(d_4 = d^C_4 + d^D_4\). This incorporation of distance information allows us to capture the positional relationships between the current token and the target entities in our input representation.

3.3 LSTM Network

After the input document undergoes the input representation module, we utilize the long short-term memory network (LSTM) [5] to effectively leverage the context information of each token embedding \(x_t\). The LSTM network consists of several controller gates to overcome the problem of vanishing gradients. At the time step t, it computes the current hidden state \(h_t\) and cell state \(c_t\) based on the input token embedding \(x_t\), the previous hidden state \(h_{t-1}\), and the previous cell state \(c_{t-1}\). The equations governing this computation are as follows:

$$\begin{aligned} i_t= & {} \sigma (W_i x_t + U_i h_{t-1} + b_i) \nonumber \\ f_t= & {} \sigma (W_f x_t + U_f h_{t-1} + b_f) \nonumber \\ o_t= & {} \sigma (W_o x_t + U_o h_{t-1} + b_o) \nonumber \\ c_t= & {} \tanh (W_c X_t + U_c h_{t-1} + b_c) \odot i_t + f_t \odot c_{t-1} \nonumber \\ h_t= & {} \tanh (c_t) \odot o_t, \end{aligned}$$
(2)

where \(W_{x}\), \( U_{x}\), and \( b_{x}, x \in \{i,f,o,c\}\) are the model parameters.

Moreover, we use two separate LSTM networks known as Forward LSTM and Backward LSTM, to capture the context information in both left-to-right and right-to-left directions simultaneously. For each token embedding \(x_t\), we generate a final hidden state \(h_{t}\), which is obtained by concatenating the Forward hidden state \(h^{f}_{t}\) and the Backward hidden state \(h^{b}_{t}\) as follows:

$$\begin{aligned} h^{f}_{t}= & {} {\text {LSTM}}^{f}(x_t, h^{f}_{t-1}) \nonumber \\ h^{b}_{t}= & {} {\text {LSTM}}^{b}(x_t, h^{b}_{t-1}) \nonumber \\ h_t= & {} h^{f}_{t} \circ h^{b}_{t}. \end{aligned}$$
(3)

3.4 State Transition Process

3.4.1 Node-Edge Representation

Let us consider each edge in our dependency graph as a tuple of (ijl). We compute the representation for each edge (ijl) as follows:

$$\begin{aligned} \begin{aligned} s_{i,j}^{l} = \tanh \Big ({\textbf {W}}_{\textrm{node}\_\textrm{edge}} \Big ( e_{l} \circ h_{i} \Big ) + {\textbf {b}}_{\text {node}\_\text {edge}} \Big ) . \end{aligned} \end{aligned}$$
(4)

Here, \({\textbf {W}}_{\text{ node }\_\text {edge}}\) and \({\textbf {b}}_{\text{ node }\_\text {edge}}\) are model weights and bias, respectively. \(e_{l}\) represents the embedding for edge type label l and \(h_{i}\) is the LSTM’s final hidden state for the token i. The edge type vector \(e_{l}\) is randomly initialized and updated in the training process.

An individual node in the dependency graph gathers information from its parents or descendants. To create new input vectors for each node, we calculate two terms: the sum of its incoming edges \(E_{\text {in}}(j)\) and the sum of its outgoing edges \(E_{\text {out}}(j)\)

$$\begin{aligned} s_{j}^{\text {in}}= & {} \sum _{(i,j,l) \in E_{\text {in}}(j)} {s_{i,j}^{l}} \nonumber \\ s_{j}^{\text {out}}= & {} \sum _{(j,k,l) \in E_{\text {out}}(j)} {s_{k,j}^{l}}. \end{aligned}$$
(5)

3.4.2 State Transition

For convenience, let us denote \(r_j\) as the state of node \(v_j\) in our dependency graph \(G = (V, E)\). Each \(r_j\) consists of two elements: the node hidden state \(\hat{h_j}\) and the node cell state \(\hat{c_j}\). As a result, we have \(r_j = (\hat{h_j}, \hat{c_j} ), \forall v_j \in V\). Additionally, we also denote the state of our graph as g, so g is presented as follows:

$$\begin{aligned} \begin{aligned} g = \{ r_j \} \vert _{v_j \in V} . \end{aligned} \end{aligned}$$
(6)

Inspired by the idea proposed in Ref. [35], we adopt a recurrent-based approach to enhance the document-level state g. This approach generates a sequence of graph states \(g_0, g_1,\ldots , g_T\), where \(g_t = \{ r^t_j \} \vert _{v_j \in V} \). The initial graph state \(g_0\) contains a set of initial node states \(r_{j}^{0} = ( \hat{h}_j^{0}, \hat{c}_{j}^{0} ), \forall v_j \in V\), which are zero vectors. The number of transition steps, denoted as T, can be determined through cross-validation.

Fig. 4
figure 4

An illustration of the state transition process [35]

During the transition state from \(g_{t-1}\) to \(g_t\), we exert an information-exchanging process among the nodes in the dependency graph. This process allows information to flow into a node from neighbor nodes that are directly connected to the node. To avoid the problem of vanishing and exploding gradient, we incorporate various kinds of controller gates inspired by the LSTM framework [5]. Figure 4 illustrates the state transition process.

Formally, to calculate the state \(r_j^{t}\) = \((\hat{h}^{t}_{j}, \hat{c}_{j}^{t} )\) for each node \(v_j\) at time step t, we compute two additional vectors \(\hat{h}_{j}^{\text {in}}\) and \(\hat{h}_{j}^{\text {out}}\). These vectors are obtained by summing all hidden states of its incoming and outgoing nodes, respectively, from the previous time step \(t-1\)

$$\begin{aligned} \hat{h}_{j}^{\text {in}}= & {} \sum _{(i,j,l) \in E_{\text {in}}(j)} {\hat{h}_{i}^{t-1}} \nonumber \\ \hat{h}_{j}^{\text {out}}= & {} \sum _{(j,k,l) \in E_{\text {out}}(j)} {\hat{h}_{k}^{t-1}}. \end{aligned}$$
(7)

The node hidden state \(\hat{h}^{t}_{j}\) and the node cell state \(\hat{c}^{t}_{j}\) are calculated using the node-edge representations \(s^{\text {in}}_{j}\), \(s^{\text {out}}_{j}\), as well as the incoming and out going hidden states \(\hat{h}_{j}^{\text {in}}\), \(\hat{h}_{j}^{\text {out}}\)

$$\begin{aligned} i_{j}^{t}= & {} \sigma { ( W^{\text {in}}_{i} s^{\text {in}}_{j} + W^{\text {out}}_{i} s_{j}^{\text {out}} + U^{\text {in}}_{i} \hat{h}^{\text {in}}_{j} + U^{\text {out}}_{i} \hat{h}^{\text {out}}_{j} + b_i )} \nonumber \\ o_{j}^{t}= & {} \sigma {( W^{\text {in}}_{o} s^{\text {in}}_{j} + W^{\text {out}}_{o} s_{j}^{\text {out}} + U^{\text {in}}_{o} \hat{h}^{\text {in}}_{j} + U^{\text {out}}_{o} \hat{h}^{\text {out}}_{j} + b_o )} \nonumber \\ f_{j}^{t}= & {} \sigma {( W^{\text {in}}_{f} s^{\text {in}}_{j} + W^{\text {out}}_{f} s_{j}^{\text {out}} + U^{\text {in}}_{f} \hat{h}^{\text {in}}_{j} + U^{\text {out}}_{f} \hat{h}^{\text {out}}_{j} + b_f )} \nonumber \\ u_{j}^{t}= & {} \sigma {( W^{\text {in}}_{u} s^{\text {in}}_{j} + W^{\text {out}}_{u} s_{j}^{\text {out}} + U^{\text {in}}_{u} \hat{h}^{\text {in}}_{j} + U^{\text {out}}_{u} \hat{h}^{\text {out}}_{j} + b_u )} \nonumber \\ \hat{c}_{j}^{t}= & {} f_{j}^{t} \odot \hat{c}_{j}^{t-1} + i_{j}^{t} \odot u_{j}^{t} \nonumber \\ \hat{h}_{j}^{t}= & {} o_{j}^{t} \odot \tanh {\hat{c}_{j}^{t}}, \end{aligned}$$
(8)

where \(i_{j}^{t}, o_{j}^{t}, f_{j}^{t}, u_{j}^{t}\) are the input, output, forget and update gates, respectively. \(W_{x}^{\text {in}}, W_{x}^{\text {out}}, U_{x}^{\text {in}}, U_{x}^{\text {out}}, and b_{x}\), (\( x \in \{ i,o,f,u \}\)) are the model parameters.

At the final transition step T, our model generates the graph state \(g_T\), which contains a set of rich features \(r_j^{T}\) = \((\hat{h}^{T}_{j}, \hat{c}_{j}^{T} )\), \(\forall v_j \in V\). We utilize the node hidden state \(\hat{h}^{T}_{j}\) to make predictions at the entity level.

3.4.3 Entity-Level Prediction

Since the CID relation was annotated at the entity level instead of the mention level, we aggregate information from all mention pairs in the document to make the final entity-level prediction.

Following the state transition process, we obtain a final hidden vector for each mention of chemical and disease entities. In the case of mentions spanning multiple nodes, we compute the sum of their node hidden vectors as their representations. Let us denote \(c = \{ c_1, c_2,\ldots , c_m \}\) and \(d = \{ d_1, d_2,\ldots , d_n \}\) as the sets of representations for chemical and disease entity mentions, respectively. Here, m and n are the numbers of mentions of each entity type. We apply a linear transformation with the tanh activation function to reduce the dimension of each chemical and disease vector.

The final representations \(c_{i}^{\text {final}}\) and \(d_{j}^{\text {final}}\) are final representations for ith chemical mention and jth disease mention, respectively, which are calculated using the following equations:

$$\begin{aligned} c_i^{\text {final}}= & {} \tanh {(W_c c_i + b_c)}, \quad \forall i = 1\ldots m \nonumber \\ d_j^{\text {final}}= & {} \tanh {(W_d d_j + b_d)}, \quad \forall j = 1\ldots n, \end{aligned}$$
(9)

where \(W_c\) and \(W_d\) are the model weights, and \(b_c\) and \(b_d\) are the corresponding bias vector for chemical and disease entities, respectively.

To calculate the prediction score for each entity mention pair, we utilize their final vectors and the relative distance between these mentions. We compute a two-dimensional vector that represents whether or not there is CID relation between the two target entities.

Formally, the score \(a_{ij}\) is computed as follows:

$$\begin{aligned} \begin{aligned} a_{ij} = W_{\text {score}} (c^{\text {final}}_{i} \circ d^{\text {final}}_{j} \circ R_{\Vert p_{c_i} - p_{d_j}\Vert }) + b_{\text {score}}. \end{aligned} \end{aligned}$$
(10)

In the equation above, \(W_{\text{ score }}\) and \(b_{\text{ score }}\) are the model parameters, and \(R_{\Vert p_{c_i} - p_{d_j}\Vert }\) represents the embedding of the relative distance between two entity mentions. This embedding is randomly initialized and being updated during training.

Finally, to obtain the final score for the entity-level prediction, we apply a max-pooling function over all entity mention pairs, as follows:

$$\begin{aligned} \text{ final }\_\text {score}(c,d) = \max (a_{ij}), \quad \forall i = 1\ldots m, \quad j = 1\ldots n. \nonumber \\ \end{aligned}$$
(11)

3.4.4 Named Entity Recognition

Previous studies have demonstrated performance improvements by incorporating named entity recognition as an auxiliary task for relation extraction [2]. In this work, we also investigate the effectiveness of joint training of relation extraction and named entity recognition (NER) for enhancing the performance of the CID extraction. We predict entity labels for each token by feeding the LSTM network output’s \(h_t\) as input to a linear classifier

$$\begin{aligned} \begin{aligned} l_t = W_{\text {ner}} h_t + b_{\text {ner}}, \end{aligned} \end{aligned}$$
(12)

where \(W_{\text {ner}}, b_{\text {ner}}\) are model parameters. Furthermore, we use the standard IOB format to encode the entity boundary.

3.4.5 Training

We employ softmax functions to compute a probability distribution for both relation extraction and named entity recognition tasks.

For relation extraction, we exert the softmax function to the entity-level prediction score to obtain a probability distribution over the set of relation labels

$$\begin{aligned} {{\textbf {P}}}({{\textbf {r}}}_{c,d}) = {\text{ Softmax }}(\text{ final }\_\text {score} (c,d) ). \end{aligned}$$
(13)

To optimize the model, we minimize the negative log-likelihood of the ground-truth relation label given the input dependency graph and our model parameters \(\theta _{re}\)

$$\begin{aligned} l_{re} = - \log {p(r_{c,d} = r^{*}_{c,d}\ \vert \ G(V,E), \theta _{re})}. \end{aligned}$$
(14)

Here, \(r_{c,d}^{*}\) is the ground-truth relation between the chemical entity c and the disease entity d.

For the named entity recognition, we utilize the softmax function to compute a probability distribution over the set of entity labels based on the entity label score \(l_t\) of token \(w_t\) as input

$$\begin{aligned} \begin{aligned} {\textbf {P}}(\varvec{y}_t) = \text {Softmax}(l_t ). \end{aligned} \end{aligned}$$
(15)

To train the named entity recognition model, we optimize the negative log-likelihood of ground-truth entity labels given the input sequence \(w_1, w_2,\ldots , w_n\) and our model parameters \(\theta _{\text {ner}}\)

$$\begin{aligned} \begin{aligned} l_{\text {ner}} = - \sum ^{n}_{t= 1} \log {p(y_t = y^{*}_{t}\ \vert \ w_{t}, \theta _{\text {ner}} )}. \end{aligned} \end{aligned}$$
(16)

Here, \(y^{*}_{t}\) denotes the ground-truth entity label for token \(w_t\).

In multi-task setting, we jointly train the named entity recognition and relation extraction tasks, which share all embeddings and LSTM network parameters. The overall loss is computed as the weighted sum of the relation extraction loss (\(l_{re}\)) and named entity recognition loss (\(l_{\text {ner}}\))

$$\begin{aligned} \begin{aligned} l_{\text {total}} = \lambda _{1} l_{re} + \lambda _{2} l_{\text {ner}}. \end{aligned} \end{aligned}$$
(17)

Here, \(\lambda _{1}\), \(\lambda _2\) are coefficients that determine the importance of each loss, being selected by performing cross-validation.

4 Model Evaluation

4.1 Dataset

We use the BioCreative V CDR (chemical-induced disease relations) corpus [1] for training, validating, and evaluating our model. This corpus consists of 1500 PudMed abstracts, of which one third (i.e., 500) is allocated for training, development, and test sets, respectively. Table 1 provides some an overview of the CDR corpus statistics.

Table 1 BioCreative V CDR corpus statistics

In our study, we utilize the golden entity annotations provided in the CDR V corpus. The model is trained with the training set, and its hyper-parameters are tuned on the development set. Subsequently, we train the model using both the training and development sets and then conduct a final evaluation on the test set. To assess the performance of ours, we employ the standard F1-score (on the test set) as the evaluation metric.

Table 2 The effectiveness of input representation of our model

4.2 Experimental Settings

In our experiments, we exert a complete biomedical text processing pipeline, namely ScispaCy [36], for word tokenization, dependency parsing, and coarse-grained POS tagging. The dimensions of the POS embeddings and the edge embeddings are both set to 10. We utilize the character embeddings and BioELMO embeddings with the dimensions of 30 and 1024, respectively.

For the LSTM network, we set the hidden state dimension to 150. The node hidden states and node cell states are both 150-dimensional vectors. The dimension of the final representation for each entity is set to 100. We encode the relative distance between two mentions of two entities as a 50-dimensional vector. The distance embedding dimension is set to 100. Furthermore, we set the number of graph steps (denoted as T) to 6.

During the model training, we employ the AdamW optimizer [37] with a learning rate of 7e−4 and a weight decay of 0.01. The minibatch size is set to 8. The epoch number is set to 3. The dropout rate is set to 0.2. Two regularization parameters \(\lambda _{1}\), \(\lambda _2\) are both set to 1 after performing cross-validation.

4.3 Experimental Results

4.3.1 Effect of Input Representation

In this experiment, we investigate the impact of incorporating additional input features into the ELMo word representation and assess their effectiveness. We observe that the inclusion of the POS embeddings yields an improvement of 0.2% in the F1 score, reaching a score of 65.0%. This enhancement demonstrates the meaningful contribution of POS information in the CID relation extraction.

Furthermore, incorporating the character embeddings further increases our model’s performance from F1 of 65.0 to 65.7%. On the other hand, when we substituting the contextual word embeddings BioELMo with the static word embedding BioWord2Vec [38], the model’s performance significantly declines from 65.7% in F1 down to 55.7%. This indicates that contextual word embeddings can generate more informative word representations, which are more beneficial for the CID extraction task. Table 2 presents a summary of our proposed model’s performance using different input representations.

4.3.2 Effect of State Transition Process

To evaluate the usefulness of the state transition process, we conduct a similar ablation experiment as done with the input representation. In this experiment, we utilize the Graph LSTM for all types of embeddings, including ELMo embedding, character embedding, POS embedding and distance embedding. We, however, remove the state transition module from the Graph LSTM model, which means that the LSTM’s output is directly used for the entity-level prediction. The results of this experiment are presented in Table 3.

Table 3 The effectiveness of state transition process in our proposed model

Table 3 demonstrates the crucial role of the state transition process in our proposed model performance, as removing it causes a significant decrease of 2.1% in the F1 score. We note that the inclusion of the graph state transition process does not change the size of DEGREx, which is of 100 million parameters.

We did thorough experiments to investigate whether DEGREx has to pay significantly more computation cost for the superior performance when adding the fully connected Graph LSTM. The results show that the inclusion of the graph state transition process causes DEGREx to complete training 0.5 min slower, from 14 to 14.5 min. Similarly, the average inference time for a test input document is 36 s slower when adding the graph state transition process, i.e., increasing from 1.16 to 2.23 min.

In addition, we explore the capability of the State Transition module in capturing inter-sentence relations. For this purpose, we create a subset of the input document where each sentence does not contain any chemical–disease mention pair. This subset servers as the input for our Graph LSTM model. The performance of our proposed model in prediction of inter-sentence CID relations is shown in Table 4.

Table 4 Our model’s performance in prediction of the inter-sentence relations

As demonstrated in Table 4, the removal of the state transition model results in a notable decrease of 1.2% in F1 for predicting inter-sentence CID relations. This underscores the critical importance of the state transition process in predicting such CID relations.

4.3.3 Effect of Multi-Task Learning

To investigate the efficacy of multi-task learning, we conduct two experiments as following. In the first experiment, we utilize the Graph LSTM solely for learning a single task, which is the CID relation extraction. In the second experiment, we employ multi-task learning to simultaneously train our model on both the relation extraction and the named entity recognition tasks by optimizing the joint loss function. Table 5 shows the performance of our model in these two experiments.

Table 5 Performance of our model in the single-task and multi-task learning contexts

The incorporation of multi-task learning has proven to be beneficial for our model, leading to a notable improvement in performance. Particularly, joint training with named entity recognition task results in an increase in F1 from 66.0 to 66.8%. This improvement (0.8% in F1) highlights the effectiveness of jointly integrating named entity recognition in enhancing entity representations, thereby improving the prediction of CID relations.

4.3.4 Comparison with Recent Related Works

We compare the document-level CID relation extraction performance of our proposed model REGREx with those of nine other recent state-of-the-art models on the golden benchmark dataset namely the BioCreative V CDR. These models are: Lu et al. [19], MRN (2021) [20], LSR (2020) [21], HGNN (2021) [22], DHG-BERT (2020) [23], SSAN-BERT (2021) [24], GCNN (2019) [3], GCN + multi-head Attn (2020) [16], and RC (2022) [25]. Out of these four are graph-based models, namely Lu et al., 2021, LSR, HGNN, and DHG-BERT. Table 6 represents experiment results of all models for the comparisons.

Table 6 Experimental results of our proposed model REGREx and other related models on the CID extraction task

Compared against eight models introduced between 2020 and 2021, our model outperforms with large margins of F1 score, ranging from 0.9 to 2.4%. We note that GCNN introduced in 2019 [3], which utilizes a labeled edge graph convolutional neural network, achieves the lowest F1 score of 58.6% among the nine compared models. Interestingly, our model still outperforms the recent RC model proposed in 2022, exposing a 0.7% improvement in F1 score.

Out of nine compared models, RC is the only one achieves a well-balanced precision and recall for document-level CID relation extraction, which is similar to our model. Although our model has the lowest recall, it exhibits the highest precision, indicating that our proposed model could yield the most rigorous CID predictions. We note that when fine-tuning the prediction threshold, our model could achieve a recall of 75.0%, a precision of 63.1%, and F1 of 68.5%. The GCN + multi-head Attn model [16] hits the highest recall at 72.7%, which is 7.5% better than ours. However, their approach integrated various predefined rules to construct training instances, which can remove a large amount of noisy entity mention pairs. It is worth noting that our model performs significantly better overall, with a 3.3% higher F1 score compared to the GCN + multi-head Attn model.

5 Discussion and Conclusion

In this study, we propose an approach to tackle the challenge of document-level chemical-induced disease (CID) relation extraction from the biomedical literature. Our proposed model REGREx constructs a unified representation graph for each input document to capture dependency information across multiple sentences. We enhance the graph representation through a state transition process, which is inspired by the controller gates in the LSTM network. Additionally, we incorporate state-of-the-art biomedical contextual word embeddings (i.e., BioELMo in our case) to enrich the input of the graph LSTM. Furthermore, we adopt a multi-task learning framework for jointly training relation extraction with named entity recognition.

Experimental results on the BioCreative V CDR benchmark corpus demonstrate the effectiveness and competitiveness of REGREx. In the single-task setting, our model achieves an F1 score of 66.0%, while in the multi-task setting, it achieves an F1 score of 66.8%. With an F1 score of 66.8%, our model is superior to all the nine compared recent state-of-the-art CID relation extraction models.

Labeling data for CID relation extraction is a time-consuming and labor-intensive task. For future work, we plan to enhance our model with a semi-supervised learning framework, specifically the self-training method. This approach will leverage a large amount of in-domain unlabeled data to further improve the performance of our model.