Document-Level Chemical-Induced Disease Semantic Relation Extraction Using Bidirectional Long Short-Term Memory on Dependency Graph

Pham Thi, Quynh-Trang; Dao, Quang Huy; Nguyen, Anh Duc; Dang, Thanh Hai

doi:10.1007/s44196-023-00305-7

Document-Level Chemical-Induced Disease Semantic Relation Extraction Using Bidirectional Long Short-Term Memory on Dependency Graph

Research Article
Open access
Published: 11 August 2023

Volume 16, article number 131, (2023)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computational Intelligence Systems Aims and scope Submit manuscript

Document-Level Chemical-Induced Disease Semantic Relation Extraction Using Bidirectional Long Short-Term Memory on Dependency Graph

Download PDF

Quynh-Trang Pham Thi¹,
Quang Huy Dao¹^na1,
Anh Duc Nguyen¹^na1 &
…
Thanh Hai Dang ORCID: orcid.org/0000-0002-0666-9958¹

1125 Accesses
2 Citations
Explore all metrics

Abstract

Identifying chemical-induced disease (CID) semantic relations in the biomedical literature, including both intra- and inter-sentence interactions, has significant implications for various downstream applications. Although various advanced methods have been proposed, they often overlook the cross-sentence dependency information, which is crucial for accurately predicting inter-sentence relations. In this study, we propose DEGREx, a novel graph-based neural model that presents a biomedical document as a dependency graph. DEGREx improves the long-distance relation extraction by allowing direct information exchange among document graph nodes through dependency connections. The information transition process is based on the idea of controller gates in long short-term memory networks. Our model, DEGREx, exerts a multi-task learning framework to jointly train relation extraction with named entity recognition, improving the performance of the CID extraction task. Experimental results on the benchmark dataset demonstrate that our model DEGREx outperforms all nine compared recent state-of-the-art models.

Integrating Word Sequences and Dependency Structures for Chemical-Disease Relation Extraction

Chemical-Gene Relation Extraction with Graph Neural Networks and BERT Encoder

Ontology-Aware Biomedical Relation Extraction

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The extraction of semantic relationships between chemical and disease entities has wide-ranging applications in biomedical research and healthcare, including toxicology studies, drug discovery, and drug safety surveillance [1]. To encourage active involvement from natural language processing (NLP) community, the BioCreative V Community [1] has launched the chemical-induced disease relation extraction (CID) challenge.

The CID relation extraction is usually formulated as a binary classification problem on recognized biomedical named entities. Given a document of multiple sentences, a list of named entity mentions, and pairs of chemical and disease concept identifiers, the objective is to determine whether a chemical–disease identifier pair has a CID relation. Figure 1 illustrates an example of CID relation extraction.

The CID extraction task, however, presents several challenges. First, the entity relations are annotated at concept level rather than the mention level in a document, where an entity can be mentioned multiple times in different sentences. As a result, interactions between entities are often transcend sentence boundaries [2], referred to as inter-sentence relations. Consequently, recognizing inter-sentence relations is much more intricate than identifying intra-sentence relations. To successfully tackle the task, the model demands a comprehensive understanding of cross-sentence dependencies and discourse information [3, 4].

Second, as entities can be mentioned across sentences, another challenge is to identify long-distance CID relations, where a chemical–disease mention pair is separated by hundreds of tokens. Conventional RNN-based models, including long short-term memory (LSTM) [5] and GRU [6], struggle to capture information from very long word sequences [7]. To address this issue, early works have proposed alternative approaches that employ either convolutional neural networks [8] or graph-based neural models [3] with advanced contextual word embeddings on document-level dependency graphs.

In this work, we propose a graph LSTM-based neural model called REGREx for extracting CID relations at the document level. Our model leverages global dependency information across multiple sentences to improve the performance. It facilitates the information exchange among nodes through dependency connections, thus enhancing long-distance relation extraction. To control information flow, we extend the LSTM network’s gating mechanism within the graph. We also enhance the input representation by incorporating advanced biomedical contextual word embeddings. Finally, we jointly train relation extraction with named entity recognition tasks to enhance the model robustness. Experimental results demonstrate superior performance with state-of-the-art methods for CID relation extraction.

2 Related Work

Deep learning models have emerged as dominant methods for various NLP tasks. Convolutional neural networks (CNN) [9, 10] and long short-term memory (LSTM) [5] have been utilized for CID relation extraction. [11] archived good performance by employing both LSTM and CNN models on traditional word embeddings [12], and some other kinds of linguistic features. [13] introduced a CNN model that extracted syntactic features from local SDPs to predict intra-sentence CID relations. [8] enhanced CNN with character-based word embeddings for CID relation extraction. The multi-head self-attention mechanism proposed by Vaswani et al. [14], which enables models to capture important and complex long-range dependencies from diverse contextual perspectives of input sequence simultaneously, has significantly contributed to the advancement of downstream NLP tasks, including CID extraction. Inspired by this concept, Sahu et al. [2] introduced a bi-affine multi-head self-attention model specifically for CID extraction, yielding on-par performance to other well-established CNN and LSTM-based methods.

Recent advanced approaches in the field have endeavored to overcome the inherent limitation of intra-sentence linguistic features. They often involve constructing various unified graphs as representations for each document, upon which a diverse range of graph-based neural approaches have been proposed. Notably, Sahu et al. [3] introduced a graph convolutional network (GCN) model [15] on a document-level graph of which edges encode both dependencies and co-references. This work has demonstrated great promise in effectively extracting inter-sentence semantic relations. In another study, Wang et al. [16] combined GCN with a multi-head self-attention mechanism on the document-level dependency graph. Additionally, some studies integrate other graphs at different levels, such as GRACR by Liu et al. [17] with an entity-level graph and MHGNN by Wang et al. [18] with three graphs, including a word-level graph, a mention-level graph, and an entity-level graph.

Lu et al. [19] constructed a hybrid graph that merges syntactic and abstract meaning representation graphs, along with hierarchical concentrative attention, effectively capturing and prioritizing long-distance important information for CID relation prediction. Li et al. [20] presented MRN, the mention-based reasoning-based model that incorporates both local and global reasoning, and a co-predictor module to predict CID relations. Nan et al. [21] introduced LSR, the latent structure refinement-based model that dynamically learns the document graph. LSR performs end-to-end predictions without relying on syntactic trees or heuristics, enhancing the extraction of CID relations. Shi et al. [22] proposed HGNN, a method leveraging Heterogeneous Graph Neural Networks. HGNN utilizes temporal convolutional networks and graph transformer networks to capture long-distance dependencies and enhance potential interactions between entities. Zhang et al. [23] presented DHG, the model based on dual-tier heterogeneous graph. DHG incorporates a structure modeling layer and a relation reasoning layer for multi-hop reasoning and decision-making. Xu et al. [24] introduced SSAN, a structural self-attention network-based model. SSAN integrates unique dependencies between mention pairs using self-attention, enhancing the overall encoding process.

Very recently, an innovative approach for document-level semantic relation extraction has been proposed by transforming the extraction task into a question answering task. Chen et al. [25] introduced RC, a model that follows this approach. After the transformation, RC leverages reading comprehension and prior knowledge to improve the document-level extraction process.

3 Method

In this section, we describe the details of our proposed model, which consists of six modules as depicted in Fig. 3. First, we describe how the document-level dependency graph is constructed for representing the entire input document. Second, we describe the method for converting each word in the original document into a vector, which serves as the input representation for the document-level dependency graph. The third sub-section focuses on the traditional LSTM network architecture, which is used to gather and enrich contextual information for each word in the paragraph. The fourth sub-section explains the process of how the graph’s state transition occurs. In the fifth sub-section is the detail of how the model makes entity-level prediction for CID relations. The following sub-section outlines the named entity recognition task. Finally, we provides the details on how the model is trained.

3.1 Document-Level Dependency Graph

The document-level dependency graph is the core component of our proposed method. To construct this unified graph, we generate a dependency tree for every sentence in the input document. Within the dependency tree, each node is connected to either its parent or descendant through syntactic dependencies (e.g., nsubject, case, and det). To create the cross-sentence dependency graph for the document, we connect the roots of two dependency trees for two consecutive sentences. This new edge type is referred to as next-sent. Additionally, we enrich our document graph by introducing an additional connection called next-node, which links two adjacent nodes together. Finally, we allow each node to connect to itself by a special edge that we denote as self-edge. Figure 2 depicts an example of our document-level dependency graph.

Formally, let us consider our document dependency graph $\textbf{G} = ( \textbf{V}, \textbf{E} )$, where $\textbf{V}$ and $\textbf{E}$ represent the sets of all nodes and edges, respectively. $\textbf{V}$ denotes the set of all word tokens in the document, whereas $\textbf{E}$ encompasses a large number of edges belonging to one of four types below:

Syntactic Dependency Edge: Represents the syntactic relation between two nodes in the dependency tree.
Next-Sent Edge: Connects two consecutive sentences to capture cross-sentence dependencies.
Next-node edge: Connects adjacent nodes within the same sentence to capture local dependencies.
Self-edge: Allows nodes to link to themselves.

3.2 Input Representation

In this section, we present how we construct the input word vectors for our document-level graph LSTM. Let us denote $x_{i} \in \mathbb {R}^{d}$ as the embedding representation of ith token in the input sequence $w_1, w_2,\ldots , w_n$. We build $x_{i}$ by combining four types of embeddings, including contextual word embedding $e_{w_i} \in \mathbb {R}^{d_1}$, character embedding $c_{w_i} \in \mathbb {R}^{d_2}$, part-of-speech (POS) embedding $p_{w_i} \in \mathbb {R}^{d_3}$, and distance embedding $d_{w_i} \in \mathbb {R}^{d_4}$. As a result, we have $d = d_1 + d_2 + d_3 + d_4$, and

$$\begin{aligned} x_{i} = e_{w_i} \circ c_{w_i} \circ p_{w_i} \circ d_{w_i}. \end{aligned}$$

(1)

Here, $\circ $ denotes the concatenate operation.

Contextual Word Embedding: To generate the contextual representation for each token in the input document, we utilize a biomedical version of ELMo [26] that has been pre-trained on 10 million PubMed abstracts, comprising a total of 2.46B tokens. The word vector $e_{w_i}$ is a dense representation in a $d_1$-dimensional space. In recent years, contextual word embeddings, such as ELMo [27], Flair [28, 29], BERT [30], and auto-regressive language modeling [31], have exhibited significant performance improvements for various natural language processing (NLP) tasks, including text classification [32], question answering [30], and named entity recognition [28]. Integrating these powerful bio-context-sensitive word embeddings with deep neural networks has the potential to enhance the performance of the CID extraction task.

Character Embedding: Previous studies have demonstrated that character-based word embeddings enable models to capture unknown words and word morphology features [33, 34]. In the biomedical domain, we often encounter complex terminologies, such as chemical, protein, or gene names that exhibit rich morphological structures. Following the approach proposed by [8], we use a simple CNN layer with $d_2$ filters applied to a sequence of character embeddings, each of which has a dimension of $d_5$. To get the character-based representation $c_{w_i}$, we apply a max-pooling layer after this CNN layer to capture the most salient features. Figure 3 depicts our CNN-based character character embeddings.

POS Embedding: In addition to the word embedding and the character embedding, we also embed the Part-of-Speech (POS) information into the input representation. The POS embedding $p_{w_i}$ is randomly initialized as a $d_3$-dimensional vector.

Distance Embedding: We enrich the input representation by incorporating absolute distances (in term of tokens) from the current token $p_{w_i}$ to two target entities. The distance embedding $d_{w_i}$ consists of two sub-vectors $d^C_{w_i} \in \mathbb {R}^{d^C_4}$ and $d^D_{w_i} \in \mathbb {R}^{d^D_4}$. These sub-vectors encode distances from $w_i$ to the Chemical and Disease entities, respectively. Both of them are randomly initialized. Formally, we have $d_{w_i} = d^C_{w_i} \circ d^D_{w_i}$, where $d_4 = d^C_4 + d^D_4$. This incorporation of distance information allows us to capture the positional relationships between the current token and the target entities in our input representation.

3.3 LSTM Network

After the input document undergoes the input representation module, we utilize the long short-term memory network (LSTM) [5] to effectively leverage the context information of each token embedding $x_t$. The LSTM network consists of several controller gates to overcome the problem of vanishing gradients. At the time step t, it computes the current hidden state $h_t$ and cell state $c_t$ based on the input token embedding $x_t$, the previous hidden state $h_{t-1}$, and the previous cell state $c_{t-1}$. The equations governing this computation are as follows:

$$\begin{aligned} i_t= & {} \sigma (W_i x_t + U_i h_{t-1} + b_i) \nonumber \\ f_t= & {} \sigma (W_f x_t + U_f h_{t-1} + b_f) \nonumber \\ o_t= & {} \sigma (W_o x_t + U_o h_{t-1} + b_o) \nonumber \\ c_t= & {} \tanh (W_c X_t + U_c h_{t-1} + b_c) \odot i_t + f_t \odot c_{t-1} \nonumber \\ h_t= & {} \tanh (c_t) \odot o_t, \end{aligned}$$

(2)

where $W_{x}$, $ U_{x}$, and $ b_{x}, x \in \{i,f,o,c\}$ are the model parameters.

Moreover, we use two separate LSTM networks known as Forward LSTM and Backward LSTM, to capture the context information in both left-to-right and right-to-left directions simultaneously. For each token embedding $x_t$, we generate a final hidden state $h_{t}$, which is obtained by concatenating the Forward hidden state $h^{f}_{t}$ and the Backward hidden state $h^{b}_{t}$ as follows:

$$\begin{aligned} h^{f}_{t}= & {} {\text {LSTM}}^{f}(x_t, h^{f}_{t-1}) \nonumber \\ h^{b}_{t}= & {} {\text {LSTM}}^{b}(x_t, h^{b}_{t-1}) \nonumber \\ h_t= & {} h^{f}_{t} \circ h^{b}_{t}. \end{aligned}$$

(3)

3.4 State Transition Process

3.4.1 Node-Edge Representation

Let us consider each edge in our dependency graph as a tuple of (i, j, l). We compute the representation for each edge (i, j, l) as follows:

$$\begin{aligned} \begin{aligned} s_{i,j}^{l} = \tanh \Big ({\textbf {W}}_{\textrm{node}\_\textrm{edge}} \Big ( e_{l} \circ h_{i} \Big ) + {\textbf {b}}_{\text {node}\_\text {edge}} \Big ) . \end{aligned} \end{aligned}$$

(4)

Here, ${\textbf {W}}_{\text{ node }\_\text {edge}}$ and ${\textbf {b}}_{\text{ node }\_\text {edge}}$ are model weights and bias, respectively. $e_{l}$ represents the embedding for edge type label l and $h_{i}$ is the LSTM’s final hidden state for the token i. The edge type vector $e_{l}$ is randomly initialized and updated in the training process.

An individual node in the dependency graph gathers information from its parents or descendants. To create new input vectors for each node, we calculate two terms: the sum of its incoming edges $E_{\text {in}}(j)$ and the sum of its outgoing edges $E_{\text {out}}(j)$

$$\begin{aligned} s_{j}^{\text {in}}= & {} \sum _{(i,j,l) \in E_{\text {in}}(j)} {s_{i,j}^{l}} \nonumber \\ s_{j}^{\text {out}}= & {} \sum _{(j,k,l) \in E_{\text {out}}(j)} {s_{k,j}^{l}}. \end{aligned}$$

(5)

3.4.2 State Transition

For convenience, let us denote $r_j$ as the state of node $v_j$ in our dependency graph $G = (V, E)$. Each $r_j$ consists of two elements: the node hidden state $\hat{h_j}$ and the node cell state $\hat{c_j}$. As a result, we have $r_j = (\hat{h_j}, \hat{c_j} ), \forall v_j \in V$. Additionally, we also denote the state of our graph as g, so g is presented as follows:

$$\begin{aligned} \begin{aligned} g = \{ r_j \} \vert _{v_j \in V} . \end{aligned} \end{aligned}$$

(6)

Inspired by the idea proposed in Ref. [35], we adopt a recurrent-based approach to enhance the document-level state g. This approach generates a sequence of graph states $g_0, g_1,\ldots , g_T$, where $g_t = \{ r^t_j \} \vert _{v_j \in V} $. The initial graph state $g_0$ contains a set of initial node states $r_{j}^{0} = ( \hat{h}_j^{0}, \hat{c}_{j}^{0} ), \forall v_j \in V$, which are zero vectors. The number of transition steps, denoted as T, can be determined through cross-validation.

During the transition state from $g_{t-1}$ to $g_t$, we exert an information-exchanging process among the nodes in the dependency graph. This process allows information to flow into a node from neighbor nodes that are directly connected to the node. To avoid the problem of vanishing and exploding gradient, we incorporate various kinds of controller gates inspired by the LSTM framework [5]. Figure 4 illustrates the state transition process.

Formally, to calculate the state $r_j^{t}$ = $(\hat{h}^{t}_{j}, \hat{c}_{j}^{t} )$ for each node $v_j$ at time step t, we compute two additional vectors $\hat{h}_{j}^{\text {in}}$ and $\hat{h}_{j}^{\text {out}}$. These vectors are obtained by summing all hidden states of its incoming and outgoing nodes, respectively, from the previous time step $t-1$

$$\begin{aligned} \hat{h}_{j}^{\text {in}}= & {} \sum _{(i,j,l) \in E_{\text {in}}(j)} {\hat{h}_{i}^{t-1}} \nonumber \\ \hat{h}_{j}^{\text {out}}= & {} \sum _{(j,k,l) \in E_{\text {out}}(j)} {\hat{h}_{k}^{t-1}}. \end{aligned}$$

(7)

The node hidden state $\hat{h}^{t}_{j}$ and the node cell state $\hat{c}^{t}_{j}$ are calculated using the node-edge representations $s^{\text {in}}_{j}$, $s^{\text {out}}_{j}$, as well as the incoming and out going hidden states $\hat{h}_{j}^{\text {in}}$, $\hat{h}_{j}^{\text {out}}$

$$\begin{aligned} i_{j}^{t}= & {} \sigma { ( W^{\text {in}}_{i} s^{\text {in}}_{j} + W^{\text {out}}_{i} s_{j}^{\text {out}} + U^{\text {in}}_{i} \hat{h}^{\text {in}}_{j} + U^{\text {out}}_{i} \hat{h}^{\text {out}}_{j} + b_i )} \nonumber \\ o_{j}^{t}= & {} \sigma {( W^{\text {in}}_{o} s^{\text {in}}_{j} + W^{\text {out}}_{o} s_{j}^{\text {out}} + U^{\text {in}}_{o} \hat{h}^{\text {in}}_{j} + U^{\text {out}}_{o} \hat{h}^{\text {out}}_{j} + b_o )} \nonumber \\ f_{j}^{t}= & {} \sigma {( W^{\text {in}}_{f} s^{\text {in}}_{j} + W^{\text {out}}_{f} s_{j}^{\text {out}} + U^{\text {in}}_{f} \hat{h}^{\text {in}}_{j} + U^{\text {out}}_{f} \hat{h}^{\text {out}}_{j} + b_f )} \nonumber \\ u_{j}^{t}= & {} \sigma {( W^{\text {in}}_{u} s^{\text {in}}_{j} + W^{\text {out}}_{u} s_{j}^{\text {out}} + U^{\text {in}}_{u} \hat{h}^{\text {in}}_{j} + U^{\text {out}}_{u} \hat{h}^{\text {out}}_{j} + b_u )} \nonumber \\ \hat{c}_{j}^{t}= & {} f_{j}^{t} \odot \hat{c}_{j}^{t-1} + i_{j}^{t} \odot u_{j}^{t} \nonumber \\ \hat{h}_{j}^{t}= & {} o_{j}^{t} \odot \tanh {\hat{c}_{j}^{t}}, \end{aligned}$$

(8)

where $i_{j}^{t}, o_{j}^{t}, f_{j}^{t}, u_{j}^{t}$ are the input, output, forget and update gates, respectively. $W_{x}^{\text {in}}, W_{x}^{\text {out}}, U_{x}^{\text {in}}, U_{x}^{\text {out}}, and b_{x}$, ($ x \in \{ i,o,f,u \}$) are the model parameters.

At the final transition step T, our model generates the graph state $g_T$, which contains a set of rich features $r_j^{T}$ = $(\hat{h}^{T}_{j}, \hat{c}_{j}^{T} )$, $\forall v_j \in V$. We utilize the node hidden state $\hat{h}^{T}_{j}$ to make predictions at the entity level.

3.4.3 Entity-Level Prediction

Since the CID relation was annotated at the entity level instead of the mention level, we aggregate information from all mention pairs in the document to make the final entity-level prediction.

Following the state transition process, we obtain a final hidden vector for each mention of chemical and disease entities. In the case of mentions spanning multiple nodes, we compute the sum of their node hidden vectors as their representations. Let us denote $c = \{ c_1, c_2,\ldots , c_m \}$ and $d = \{ d_1, d_2,\ldots , d_n \}$ as the sets of representations for chemical and disease entity mentions, respectively. Here, m and n are the numbers of mentions of each entity type. We apply a linear transformation with the tanh activation function to reduce the dimension of each chemical and disease vector.

The final representations $c_{i}^{\text {final}}$ and $d_{j}^{\text {final}}$ are final representations for ith chemical mention and jth disease mention, respectively, which are calculated using the following equations:

$$\begin{aligned} c_i^{\text {final}}= & {} \tanh {(W_c c_i + b_c)}, \quad \forall i = 1\ldots m \nonumber \\ d_j^{\text {final}}= & {} \tanh {(W_d d_j + b_d)}, \quad \forall j = 1\ldots n, \end{aligned}$$

(9)

where $W_c$ and $W_d$ are the model weights, and $b_c$ and $b_d$ are the corresponding bias vector for chemical and disease entities, respectively.

To calculate the prediction score for each entity mention pair, we utilize their final vectors and the relative distance between these mentions. We compute a two-dimensional vector that represents whether or not there is CID relation between the two target entities.

Formally, the score $a_{ij}$ is computed as follows:

$$\begin{aligned} \begin{aligned} a_{ij} = W_{\text {score}} (c^{\text {final}}_{i} \circ d^{\text {final}}_{j} \circ R_{\Vert p_{c_i} - p_{d_j}\Vert }) + b_{\text {score}}. \end{aligned} \end{aligned}$$

(10)

In the equation above, $W_{\text{ score }}$ and $b_{\text{ score }}$ are the model parameters, and $R_{\Vert p_{c_i} - p_{d_j}\Vert }$ represents the embedding of the relative distance between two entity mentions. This embedding is randomly initialized and being updated during training.

Finally, to obtain the final score for the entity-level prediction, we apply a max-pooling function over all entity mention pairs, as follows:

$$\begin{aligned} \text{ final }\_\text {score}(c,d) = \max (a_{ij}), \quad \forall i = 1\ldots m, \quad j = 1\ldots n. \nonumber \\ \end{aligned}$$

(11)

3.4.4 Named Entity Recognition

Previous studies have demonstrated performance improvements by incorporating named entity recognition as an auxiliary task for relation extraction [2]. In this work, we also investigate the effectiveness of joint training of relation extraction and named entity recognition (NER) for enhancing the performance of the CID extraction. We predict entity labels for each token by feeding the LSTM network output’s $h_t$ as input to a linear classifier

$$\begin{aligned} \begin{aligned} l_t = W_{\text {ner}} h_t + b_{\text {ner}}, \end{aligned} \end{aligned}$$

(12)

where $W_{\text {ner}}, b_{\text {ner}}$ are model parameters. Furthermore, we use the standard IOB format to encode the entity boundary.

3.4.5 Training

We employ softmax functions to compute a probability distribution for both relation extraction and named entity recognition tasks.

For relation extraction, we exert the softmax function to the entity-level prediction score to obtain a probability distribution over the set of relation labels

$$\begin{aligned} {{\textbf {P}}}({{\textbf {r}}}_{c,d}) = {\text{ Softmax }}(\text{ final }\_\text {score} (c,d) ). \end{aligned}$$

(13)

To optimize the model, we minimize the negative log-likelihood of the ground-truth relation label given the input dependency graph and our model parameters $\theta _{re}$

$$\begin{aligned} l_{re} = - \log {p(r_{c,d} = r^{*}_{c,d}\ \vert \ G(V,E), \theta _{re})}. \end{aligned}$$

(14)

Here, $r_{c,d}^{*}$ is the ground-truth relation between the chemical entity c and the disease entity d.

For the named entity recognition, we utilize the softmax function to compute a probability distribution over the set of entity labels based on the entity label score $l_t$ of token $w_t$ as input

$$\begin{aligned} \begin{aligned} {\textbf {P}}(\varvec{y}_t) = \text {Softmax}(l_t ). \end{aligned} \end{aligned}$$

(15)

To train the named entity recognition model, we optimize the negative log-likelihood of ground-truth entity labels given the input sequence $w_1, w_2,\ldots , w_n$ and our model parameters $\theta _{\text {ner}}$

$$\begin{aligned} \begin{aligned} l_{\text {ner}} = - \sum ^{n}_{t= 1} \log {p(y_t = y^{*}_{t}\ \vert \ w_{t}, \theta _{\text {ner}} )}. \end{aligned} \end{aligned}$$

(16)

Here, $y^{*}_{t}$ denotes the ground-truth entity label for token $w_t$.

In multi-task setting, we jointly train the named entity recognition and relation extraction tasks, which share all embeddings and LSTM network parameters. The overall loss is computed as the weighted sum of the relation extraction loss ($l_{re}$) and named entity recognition loss ($l_{\text {ner}}$)

$$\begin{aligned} \begin{aligned} l_{\text {total}} = \lambda _{1} l_{re} + \lambda _{2} l_{\text {ner}}. \end{aligned} \end{aligned}$$

(17)

Here, $\lambda _{1}$, $\lambda _2$ are coefficients that determine the importance of each loss, being selected by performing cross-validation.

4 Model Evaluation

4.1 Dataset

We use the BioCreative V CDR (chemical-induced disease relations) corpus [1] for training, validating, and evaluating our model. This corpus consists of 1500 PudMed abstracts, of which one third (i.e., 500) is allocated for training, development, and test sets, respectively. Table 1 provides some an overview of the CDR corpus statistics.

Table 1 BioCreative V CDR corpus statistics

Full size table

In our study, we utilize the golden entity annotations provided in the CDR V corpus. The model is trained with the training set, and its hyper-parameters are tuned on the development set. Subsequently, we train the model using both the training and development sets and then conduct a final evaluation on the test set. To assess the performance of ours, we employ the standard F1-score (on the test set) as the evaluation metric.

Table 2 The effectiveness of input representation of our model

Full size table

4.2 Experimental Settings

In our experiments, we exert a complete biomedical text processing pipeline, namely ScispaCy [36], for word tokenization, dependency parsing, and coarse-grained POS tagging. The dimensions of the POS embeddings and the edge embeddings are both set to 10. We utilize the character embeddings and BioELMO embeddings with the dimensions of 30 and 1024, respectively.

For the LSTM network, we set the hidden state dimension to 150. The node hidden states and node cell states are both 150-dimensional vectors. The dimension of the final representation for each entity is set to 100. We encode the relative distance between two mentions of two entities as a 50-dimensional vector. The distance embedding dimension is set to 100. Furthermore, we set the number of graph steps (denoted as T) to 6.

During the model training, we employ the AdamW optimizer [37] with a learning rate of 7e−4 and a weight decay of 0.01. The minibatch size is set to 8. The epoch number is set to 3. The dropout rate is set to 0.2. Two regularization parameters $\lambda _{1}$, $\lambda _2$ are both set to 1 after performing cross-validation.

4.3 Experimental Results

4.3.1 Effect of Input Representation

In this experiment, we investigate the impact of incorporating additional input features into the ELMo word representation and assess their effectiveness. We observe that the inclusion of the POS embeddings yields an improvement of 0.2% in the F1 score, reaching a score of 65.0%. This enhancement demonstrates the meaningful contribution of POS information in the CID relation extraction.

Furthermore, incorporating the character embeddings further increases our model’s performance from F1 of 65.0 to 65.7%. On the other hand, when we substituting the contextual word embeddings BioELMo with the static word embedding BioWord2Vec [38], the model’s performance significantly declines from 65.7% in F1 down to 55.7%. This indicates that contextual word embeddings can generate more informative word representations, which are more beneficial for the CID extraction task. Table 2 presents a summary of our proposed model’s performance using different input representations.

4.3.2 Effect of State Transition Process

To evaluate the usefulness of the state transition process, we conduct a similar ablation experiment as done with the input representation. In this experiment, we utilize the Graph LSTM for all types of embeddings, including ELMo embedding, character embedding, POS embedding and distance embedding. We, however, remove the state transition module from the Graph LSTM model, which means that the LSTM’s output is directly used for the entity-level prediction. The results of this experiment are presented in Table 3.

Table 3 The effectiveness of state transition process in our proposed model

Full size table

Table 3 demonstrates the crucial role of the state transition process in our proposed model performance, as removing it causes a significant decrease of 2.1% in the F1 score. We note that the inclusion of the graph state transition process does not change the size of DEGREx, which is of 100 million parameters.

We did thorough experiments to investigate whether DEGREx has to pay significantly more computation cost for the superior performance when adding the fully connected Graph LSTM. The results show that the inclusion of the graph state transition process causes DEGREx to complete training 0.5 min slower, from 14 to 14.5 min. Similarly, the average inference time for a test input document is 36 s slower when adding the graph state transition process, i.e., increasing from 1.16 to 2.23 min.

In addition, we explore the capability of the State Transition module in capturing inter-sentence relations. For this purpose, we create a subset of the input document where each sentence does not contain any chemical–disease mention pair. This subset servers as the input for our Graph LSTM model. The performance of our proposed model in prediction of inter-sentence CID relations is shown in Table 4.

Table 4 Our model’s performance in prediction of the inter-sentence relations

Full size table

As demonstrated in Table 4, the removal of the state transition model results in a notable decrease of 1.2% in F1 for predicting inter-sentence CID relations. This underscores the critical importance of the state transition process in predicting such CID relations.

4.3.3 Effect of Multi-Task Learning

To investigate the efficacy of multi-task learning, we conduct two experiments as following. In the first experiment, we utilize the Graph LSTM solely for learning a single task, which is the CID relation extraction. In the second experiment, we employ multi-task learning to simultaneously train our model on both the relation extraction and the named entity recognition tasks by optimizing the joint loss function. Table 5 shows the performance of our model in these two experiments.

Table 5 Performance of our model in the single-task and multi-task learning contexts

Full size table

The incorporation of multi-task learning has proven to be beneficial for our model, leading to a notable improvement in performance. Particularly, joint training with named entity recognition task results in an increase in F1 from 66.0 to 66.8%. This improvement (0.8% in F1) highlights the effectiveness of jointly integrating named entity recognition in enhancing entity representations, thereby improving the prediction of CID relations.

4.3.4 Comparison with Recent Related Works

We compare the document-level CID relation extraction performance of our proposed model REGREx with those of nine other recent state-of-the-art models on the golden benchmark dataset namely the BioCreative V CDR. These models are: Lu et al. [19], MRN (2021) [20], LSR (2020) [21], HGNN (2021) [22], DHG-BERT (2020) [23], SSAN-BERT (2021) [24], GCNN (2019) [3], GCN + multi-head Attn (2020) [16], and RC (2022) [25]. Out of these four are graph-based models, namely Lu et al., 2021, LSR, HGNN, and DHG-BERT. Table 6 represents experiment results of all models for the comparisons.

Table 6 Experimental results of our proposed model REGREx and other related models on the CID extraction task

Full size table

Compared against eight models introduced between 2020 and 2021, our model outperforms with large margins of F1 score, ranging from 0.9 to 2.4%. We note that GCNN introduced in 2019 [3], which utilizes a labeled edge graph convolutional neural network, achieves the lowest F1 score of 58.6% among the nine compared models. Interestingly, our model still outperforms the recent RC model proposed in 2022, exposing a 0.7% improvement in F1 score.

Out of nine compared models, RC is the only one achieves a well-balanced precision and recall for document-level CID relation extraction, which is similar to our model. Although our model has the lowest recall, it exhibits the highest precision, indicating that our proposed model could yield the most rigorous CID predictions. We note that when fine-tuning the prediction threshold, our model could achieve a recall of 75.0%, a precision of 63.1%, and F1 of 68.5%. The GCN + multi-head Attn model [16] hits the highest recall at 72.7%, which is 7.5% better than ours. However, their approach integrated various predefined rules to construct training instances, which can remove a large amount of noisy entity mention pairs. It is worth noting that our model performs significantly better overall, with a 3.3% higher F1 score compared to the GCN + multi-head Attn model.

5 Discussion and Conclusion

In this study, we propose an approach to tackle the challenge of document-level chemical-induced disease (CID) relation extraction from the biomedical literature. Our proposed model REGREx constructs a unified representation graph for each input document to capture dependency information across multiple sentences. We enhance the graph representation through a state transition process, which is inspired by the controller gates in the LSTM network. Additionally, we incorporate state-of-the-art biomedical contextual word embeddings (i.e., BioELMo in our case) to enrich the input of the graph LSTM. Furthermore, we adopt a multi-task learning framework for jointly training relation extraction with named entity recognition.

Experimental results on the BioCreative V CDR benchmark corpus demonstrate the effectiveness and competitiveness of REGREx. In the single-task setting, our model achieves an F1 score of 66.0%, while in the multi-task setting, it achieves an F1 score of 66.8%. With an F1 score of 66.8%, our model is superior to all the nine compared recent state-of-the-art CID relation extraction models.

Labeling data for CID relation extraction is a time-consuming and labor-intensive task. For future work, we plan to enhance our model with a semi-supervised learning framework, specifically the self-training method. This approach will leverage a large amount of in-domain unlabeled data to further improve the performance of our model.

Availability of Data and Materials

The datasets analyzed during the current study are available from the corresponding author upon reasonable request.

References

Wei, C.H., Peng, Y., Leaman, R., Davis, A.P., Mattingly, C.J., Li, J., Lu, Z.: Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task. Database 2016, baw032 (2016)
Verga, P., Strubell, E., McCallum, A.: Simultaneously self-attending to all mentions for full-abstract biological relation extraction. arXiv preprint arXiv:1802.10569 (2018)
Sahu, S.K., Christopoulou, F., Miwa, M., Ananiadou, S.: Inter-sentence relation extraction with document-level graph convolutional neural network. arXiv preprint arXiv:1906.04684 (2019)
Quirk, C., Poon, H.: Distant supervision for relation extraction beyond the sentence boundary. arXiv preprint arXiv:1609.04873 (2016)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Bradbury, J., Merity, S., Xiong, C., Socher, R.: Quasi-recurrent neural networks. arXiv preprint arXiv:1611.01576 (2016)
Nguyen, D.Q., Verspoor, K.: Convolutional neural networks for chemical-disease relation extraction are improved with character-based word embeddings. arXiv preprint arXiv:1805.10586 (2018)
LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time series. Handb. Brain Theory Neural Netw. 3361(10), 1995 (1995)
Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. CoRR abs/1408.5882. arXiv preprint arXiv:1408.5882 (2014)
Zhou, H., Deng, H., Chen, L., Yang, Y., Jia, C., Huang, D.: Exploiting syntactic and semantics information for chemical-disease relation extraction. Database 2016, baw048 (2016)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, 26 (2013)
Gu, J., Sun, F., Qian, L., Zhou, G.: Chemical-induced disease relation extraction via convolutional neural network. Database 2017, bax024 (2017)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, 30 (2017)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Wang, J., Chen, X., Zhang, Y., Zhang, Y., Wen, J., Lin, H., Wang, X.: Document-level biomedical relation extraction using graph convolutional network and multihead attention: algorithm development and validation. JMIR Med. Inform. 8(7), e17638 (2020)
Article Google Scholar
Liu, H., Kang, Z., Zhang, L., Tian, L., Hua, F.: Document-level relation extraction with cross-sentence reasoning graph. In: Kashima, H., Ide, T., Peng, W.C. (eds.) Advances in Knowledge Discovery and Data Mining. PAKDD 2023. Lecture Notes in Computer Science, vol. 13935. Springer, Cham (2023)
Wang, N., Chen, T., Ren, C., Wang, H.: Document-level relation extraction with multi-layer heterogeneous graph attention network. Eng. Appl. Artif. Intell. 123, 106212 (2023)
Article Google Scholar
Lu, H., Li, L., Li, Z., Zhao, S.: Extracting chemical-induced disease relation by integrating a hierarchical concentrative attention and a hybrid graph-based neural network. J. Biomed. Inform. 121, 103874 (2021)
Article Google Scholar
Li, J., Xu, K., Li, F., Fei, H., Ren, Y., Ji, D.: MRN: a locally and globally mention-based reasoning network for document-level relation extraction. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 1359–1370 (2021)
Nan, G., Guo, Z., Sekulić, I., Lu, W.: Reasoning with latent structure refinement for document-level relation extraction. arXiv preprint arXiv:2005.06312 (2020)
Shi, Y., Xiao, Y., Quan, P., Lei, M., Niu, L.: Document-level relation extraction via graph transformer networks and temporal convolutional networks. Pattern Recogn. Lett. 149, 150–156 (2021)
Article Google Scholar
Zhang, Z., Yu, B., Shu, X., Liu, T., Tang, H., Yubin, W., Guo, L.: Document-level relation extraction with dual-tier heterogeneous graph. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 1630–1641 (2020)
Xu, B., Wang, Q., Lyu, Y., Zhu, Y., Mao, Z.: Entity structure within and throughout: Modeling mentions dependencies for document-level relation extraction. In: Proceedings of AAAI (2021)
Chen, J., Hu, B., Peng, W., Chen, Q., Tang, B.: Biomedical relation extraction via knowledge-enhanced reading comprehension. BMC Bioinform. 23(1), 1–19 (2022)
Jin, Q., Dhingra, B., Cohen, W.W., Lu, X.: Probing biomedical embeddings from language models. arXiv preprint arXiv:1904.02181 (2019)
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Allen, P.G.: Deep Contextualized Word Representations, pp. 2227–2237. Association for Computational Linguistics, New Orleans (2018)
Google Scholar
Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649 (2018)
Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., Vollgraf, R.: FLAIR: an easy-to-use framework for state-of-the-art NLP. In: Proceedings of the 2019 Conference of the North American Chapter of the Association For Computational Linguistics (Demonstrations), pp. 54–59 (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Google Scholar
Howard, J., Ruder, S.: Fine-tuned language models for text classification. arXiv preprint arXiv:1801.06146, 194 (2018)
Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354 (2016)
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016)
Song, L., Zhang, Y., Wang, Z., Gildea, D.: N-ary relation extraction using graph state lstm. arXiv preprint arXiv:1808.09101 (2018)
Neumann, M., King, D., Beltagy, I., Ammar, W.: ScispaCy: fast and robust models for biomedical natural language processing. arXiv preprint arXiv:1902.07669 (2019)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Moen, S.P.F.G.H., Ananiadou, T.S.S.: Distributional semantics resources for biomedical text processing. In: Proceedings of LBM, pp. 39–44 (2013)

Download references

Acknowledgements

The authors are very grateful to the anonymous referees for their valuable comments and suggestions.

Funding

This work has been supported by Vietnam National University, Hanoi (VNU), under Project No. QG.20.53.

Author information

Quang Huy Dao and Anh Duc Nguyen have contributed equally to this work.

Authors and Affiliations

Faculty of Information Technology, VNU University of Engineering and Technology, 144 Xuan Thuy, Cau Giay, Hanoi, 100000, Viet Nam
Quynh-Trang Pham Thi, Quang Huy Dao, Anh Duc Nguyen & Thanh Hai Dang

Authors

Quynh-Trang Pham Thi
View author publications
You can also search for this author in PubMed Google Scholar
Quang Huy Dao
View author publications
You can also search for this author in PubMed Google Scholar
Anh Duc Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Thanh Hai Dang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, methodology, formal analysis and investigation, writing—review, and editing by QTPT. Methodology, writing—original draft preparation, and programming by QHD. Methodology, programming, and experiment conduction by ADN. Conceptualization, methodology, funding acquisition, and supervision were performed by THD.

Corresponding author

Correspondence to Thanh Hai Dang.

Ethics declarations

Conflict of Interest

The authors declare that they have no competing interests.

Consent for Publication

Not applicable.

Ethics Approval and Consent to Participate

Not applicable.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Pham Thi, QT., Dao, Q.H., Nguyen, A.D. et al. Document-Level Chemical-Induced Disease Semantic Relation Extraction Using Bidirectional Long Short-Term Memory on Dependency Graph. Int J Comput Intell Syst 16, 131 (2023). https://doi.org/10.1007/s44196-023-00305-7

Download citation

Received: 30 December 2022
Accepted: 23 July 2023
Published: 11 August 2023
DOI: https://doi.org/10.1007/s44196-023-00305-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Document-Level Chemical-Induced Disease Semantic Relation Extraction Using Bidirectional Long Short-Term Memory on Dependency Graph

Abstract

Similar content being viewed by others

Integrating Word Sequences and Dependency Structures for Chemical-Disease Relation Extraction

Chemical-Gene Relation Extraction with Graph Neural Networks and BERT Encoder

Ontology-Aware Biomedical Relation Extraction

1 Introduction

2 Related Work

3 Method

3.1 Document-Level Dependency Graph

3.2 Input Representation

3.3 LSTM Network

3.4 State Transition Process

3.4.1 Node-Edge Representation

3.4.2 State Transition

3.4.3 Entity-Level Prediction

3.4.4 Named Entity Recognition

3.4.5 Training

4 Model Evaluation

4.1 Dataset

4.2 Experimental Settings

4.3 Experimental Results

4.3.1 Effect of Input Representation

4.3.2 Effect of State Transition Process

4.3.3 Effect of Multi-Task Learning

4.3.4 Comparison with Recent Related Works

5 Discussion and Conclusion

Availability of Data and Materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interest

Consent for Publication

Ethics Approval and Consent to Participate

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation