Document-Level Chemical-Induced Disease Semantic Relation Extraction Using Bidirectional Long Short-Term Memory on Dependency Graph

Identifying chemical-induced disease (CID) semantic relations in the biomedical literature, including both intra- and inter-sentence interactions, has significant implications for various downstream applications. Although various advanced methods have been proposed, they often overlook the cross-sentence dependency information, which is crucial for accurately predicting inter-sentence relations. In this study, we propose DEGREx, a novel graph-based neural model that presents a biomedical document as a dependency graph. DEGREx improves the long-distance relation extraction by allowing direct information exchange among document graph nodes through dependency connections. The information transition process is based on the idea of controller gates in long short-term memory networks. Our model, DEGREx, exerts a multi-task learning framework to jointly train relation extraction with named entity recognition, improving the performance of the CID extraction task. Experimental results on the benchmark dataset demonstrate that our model DEGREx outperforms all nine compared recent state-of-the-art models.


Introduction
The extraction of semantic relationships between chemical and disease entities has wide-ranging applications in biomedical research and healthcare, including toxicology studies, drug discovery, and drug safety surveillance [1].To encourage active involvement from natural language processing (NLP) community, the BioCreative V Community [1] has launched the chemical-induced disease relation extraction (CID) challenge.
The CID relation extraction is usually formulated as a binary classification problem on recognized biomedical named entities.Given a document of multiple sentences, a list of named entity mentions, and pairs of chemical and disease concept identifiers, the objective is to determine whether a chemical-disease identifier pair has a CID relation.Figure 1 illustrates an example of CID relation extraction.
The CID extraction task, however, presents several challenges.First, the entity relations are annotated at concept level rather than the mention level in a document, where an entity can be mentioned multiple times in different sentences.As a result, interactions between entities are often transcend sentence boundaries [2], referred to as inter-sentence relations.Consequently, recognizing inter-sentence relations is much more intricate than identifying intra-sentence relations.To successfully tackle the task, the model demands a comprehensive understanding of cross-sentence dependencies and discourse information [3,4].
Second, as entities can be mentioned across sentences, another challenge is to identify long-distance CID relations, where a chemical-disease mention pair is separated by hundreds of tokens.Conventional RNN-based models, including long short-term memory (LSTM) [5] and GRU [6], struggle to capture information from very long word sequences [7].To address this issue, early works have proposed alternative approaches that employ either convolutional neural networks [8] or graph-based neural models [3] with advanced contextual word embeddings on document-level dependency graphs.
In this work, we propose a graph LSTM-based neural model called REGREx for extracting CID relations at the document level.Our model leverages global dependency information across multiple sentences to improve the performance.It facilitates the information exchange among nodes through dependency connections, thus enhancing long-distance relation extraction.To control information flow, we extend the LSTM network's gating mechanism within the graph.We also enhance the input representation by incorporating advanced biomedical contextual word embeddings.Finally, we jointly train relation extraction with named entity recognition tasks to enhance the model robustness.Experimental results demonstrate superior performance with state-of-the-art methods for CID relation extraction.

Related Work
Deep learning models have emerged as dominant methods for various NLP tasks.Convolutional neural networks (CNN) [9,10] and long short-term memory (LSTM) [5] have been utilized for CID relation extraction.[11] archived good performance by employing both LSTM and CNN models on traditional word embeddings [12], and some other kinds of linguistic features.[13] introduced a CNN model that extracted syntactic features from local SDPs to predict intra-sentence CID relations.[8] enhanced CNN with character-based word embeddings for CID relation extraction.The multi-head self-attention mechanism proposed by Vaswani et al. [14], which enables models to capture important and complex long-range dependencies from diverse contextual perspectives of input sequence simultaneously, has significantly contributed to the advancement of downstream NLP tasks, including CID extraction.Inspired by this concept, Sahu et al. [2] introduced a bi-affine multi-head self-attention model specifically for CID extraction, yielding on-par performance to other well-established CNN and LSTM-based methods.
Recent advanced approaches in the field have endeavored to overcome the inherent limitation of intra-sentence linguistic features.They often involve constructing various unified graphs as representations for each document, upon which a diverse range of graph-based neural approaches have been proposed.Notably, Sahu et al. [3] introduced a graph convolutional network (GCN) model [15] on a documentlevel graph of which edges encode both dependencies and co-references.This work has demonstrated great promise in effectively extracting inter-sentence semantic relations.In another study, Wang et al. [16] combined GCN with a multi-head self-attention mechanism on the document-level dependency graph.Additionally, some studies integrate other graphs at different levels, such as GRACR by Liu et al. [17] with an entity-level graph and MHGNN by Wang et al. [18] with three graphs, including a word-level graph, a mentionlevel graph, and an entity-level graph.
Lu et al. [19] constructed a hybrid graph that merges syntactic and abstract meaning representation graphs, along with hierarchical concentrative attention, effectively capturing and prioritizing long-distance important information for CID relation prediction.Li et al. [20] presented MRN, the mention-based reasoning-based model that incorporates both local and global reasoning, and a co-predictor module to predict CID relations.Nan et al. [21] introduced LSR, the latent structure refinement-based model that dynamically learns the document graph.LSR performs end-to-end predictions without relying on syntactic trees or heuristics, enhancing the extraction of CID relations.Shi et al. [22] proposed HGNN, a method leveraging Heterogeneous Graph Neural Networks.HGNN utilizes temporal convolutional networks and graph transformer networks to capture long-distance dependencies and enhance potential interactions between entities.Zhang et al. [23] presented DHG, the model based on dual-tier heterogeneous graph.DHG incorporates a structure modeling layer and a relation reasoning layer for multi-hop reasoning and decision-making.Xu et al. [24] introduced SSAN, a structural self-attention network-based model.SSAN integrates unique dependencies between mention pairs using self-attention, enhancing the overall encoding process.
Very recently, an innovative approach for document-level semantic relation extraction has been proposed by transforming the extraction task into a question answering task.Chen et al. [25] introduced RC, a model that follows this approach.After the transformation, RC leverages reading comprehension and prior knowledge to improve the document-level extraction process.

Method
In this section, we describe the details of our proposed model, which consists of six modules as depicted in Fig. 3. First, we describe how the document-level dependency graph is constructed for representing the entire input document.Second, we describe the method for converting each word in the original document into a vector, which serves as the input representation for the document-level dependency graph.The third sub-section focuses on the traditional LSTM network architecture, which is used to gather and enrich contextual information for each word in the paragraph.The fourth sub-section explains the process of how the graph's state transition occurs.In the fifth sub-section is the detail of how the model makes entity-level prediction for CID relations.The following sub-section outlines the named entity recognition task.Finally, we provides the details on how the model is trained.

Document-Level Dependency Graph
The document-level dependency graph is the core component of our proposed method.To construct this unified graph, we generate a dependency tree for every sentence in the input document.Within the dependency tree, each node is connected to either its parent or descendant through syntactic dependencies (e.g., nsubject, case, and det).To create the cross-sentence dependency graph for the document, we connect the roots of two dependency trees for two consecutive sentences.This new edge type is referred to as next-sent.Additionally, we enrich our document graph by introducing an additional connection called next-node, which links two adjacent nodes together.Finally, we allow each node to connect to itself by a special edge that we denote as self-edge.Figure 2 depicts an example of our document-level dependency graph.
Formally, let us consider our document dependency graph G = (V, E), where V and E represent the sets of all nodes and edges, respectively.V denotes the set of all word tokens in the document, whereas E encompasses a large number of edges belonging to one of four types below: • Syntactic Dependency Edge: Represents the syntactic relation between two nodes in the dependency tree.
• Next-Sent Edge: Connects two consecutive sentences to capture cross-sentence dependencies.• Next-node edge: Connects adjacent nodes within the same sentence to capture local dependencies.• Self-edge: Allows nodes to link to themselves.

Input Representation
In this section, we present how we construct the input word vectors for our document-level graph LSTM.Let us denote x i ∈ R d as the embedding representation of ith token in the input sequence w 1 , w 2 , . . ., w n .We build x i by combining four types of embeddings, including contextual word embedding e w i ∈ R d 1 , character embedding and ( Here, • denotes the concatenate operation.

Contextual Word Embedding:
To generate the contextual representation for each token in the input document, we utilize a biomedical version of ELMo [26] that has been pre-trained on 10 million PubMed abstracts, comprising a total of 2.46B tokens.The word vector e w i is a dense representation in a d 1 -dimensional space.In recent years, contextual word embeddings, such as ELMo [27], Flair [28,29], BERT [30], and auto-regressive language modeling [31], have exhibited significant performance improvements for various natural language processing (NLP) tasks, including text classification [32], question answering [30], and named entity recognition [28].Integrating these powerful bio-context-sensitive word embeddings with deep neural net-Fig. 2 An example of our dependency graph.For simplicity, we only consider two consecutive sentences and omit all next-node edges and self-edges Fig. 3 An illustration of the convolutional neural network (CNN)-based character embeddings [33] works has the potential to enhance the performance of the CID extraction task.
Character Embedding: Previous studies have demonstrated that character-based word embeddings enable models to capture unknown words and word morphology features [33,34].In the biomedical domain, we often encounter complex terminologies, such as chemical, protein, or gene names that exhibit rich morphological structures.Following the approach proposed by [8], we use a simple CNN layer with d 2 filters applied to a sequence of character embeddings, each of which has a dimension of d 5 .To get the characterbased representation c w i , we apply a max-pooling layer after this CNN layer to capture the most salient features.Figure 3 depicts our CNN-based character character embeddings.
POS Embedding: In addition to the word embedding and the character embedding, we also embed the Part-of-Speech (POS) information into the input representation.The POS embedding p w i is randomly initialized as a d 3 -dimensional vector.
Distance Embedding: We enrich the input representation by incorporating absolute distances (in term of tokens) from the current token p w i to two target entities.The distance embedding d w i consists of two sub-vectors 4 .These sub-vectors encode distances from w i to the Chemical and Disease entities, respectively.Both of them are randomly initialized.Formally, we have . This incorporation of distance information allows us to capture the positional relationships between the current token and the target entities in our input representation.

LSTM Network
After the input document undergoes the input representation module, we utilize the long short-term memory network (LSTM) [5] to effectively leverage the context information of each token embedding x t .The LSTM network consists of several controller gates to overcome the problem of vanishing gradients.At the time step t, it computes the current hidden state h t and cell state c t based on the input token embedding x t , the previous hidden state h t−1 , and the previous cell state c t−1 .The equations governing this computation are as follows: where

Node-Edge Representation
Let us consider each edge in our dependency graph as a tuple of (i, j, l).We compute the representation for each edge (i, j, l) as follows: Here, W node_edge and b node_edge are model weights and bias, respectively.e l represents the embedding for edge type label l and h i is the LSTM's final hidden state for the token i.The edge type vector e l is randomly initialized and updated in the training process.
An individual node in the dependency graph gathers information from its parents or descendants.To create new input vectors for each node, we calculate two terms: the sum of its incoming edges E in ( j) and the sum of its outgoing edges E out ( j) s l k, j . (5)

State Transition
For convenience, let us denote r j as the state of node v j in our dependency graph G = (V , E).Each r j consists of two elements: the node hidden state ĥ j and the node cell state ĉ j .As a result, we have r j = ( ĥ j , ĉ j ), ∀v j ∈ V .Additionally, we also denote the state of our graph as g, so g is presented as follows: Inspired by the idea proposed in Ref. [35], we adopt a recurrent-based approach to enhance the document-level state g.This approach generates a sequence of graph states g 0 , g 1 , . . ., g T , where g t = {r t j }| v j ∈V .The initial graph state g 0 contains a set of initial node states r 0 j = ( ĥ0 j , ĉ0 j ), ∀v j ∈ V , which are zero vectors.The number of transition steps, denoted as T , can be determined through cross-validation.
During the transition state from g t−1 to g t , we exert an information-exchanging process among the nodes in the dependency graph.This process allows information to flow into a node from neighbor nodes that are directly connected to the node.To avoid the problem of vanishing and exploding gradient, we incorporate various kinds of controller gates inspired by the LSTM framework [5]. Figure 4 illustrates the state transition process.
Formally, to calculate the state r t j = ( ĥt j , ĉt j ) for each node v j at time step t, we compute two additional vectors ĥin j and ĥout j .These vectors are obtained by summing all hidden states of its incoming and outgoing nodes, respectively, from the previous time step t − 1 The node hidden state ĥt j and the node cell state ĉt j are calculated using the node-edge representations s in j , s out j , as well as the incoming and out going hidden states ĥin j , ĥout Fig. 4 An illustration of the state transition process [35] where i t j , o t j , f t j , u t j are the input, output, forget and update gates, respectively.W in x , W out x , U in x , U out x , and b x , (x ∈ {i, o, f , u}) are the model parameters.
At the final transition step T , our model generates the graph state g T , which contains a set of rich features r T j = ( ĥT j , ĉT j ), ∀v j ∈ V .We utilize the node hidden state ĥT j to make predictions at the entity level.

Entity-Level Prediction
Since the CID relation was annotated at the entity level instead of the mention level, we aggregate information from all mention pairs in the document to make the final entitylevel prediction.
Following the state transition process, we obtain a final hidden vector for each mention of chemical and disease entities.In the case of mentions spanning multiple nodes, we compute the sum of their node hidden vectors as their representations.Let us denote c = {c 1 , c 2 , . . ., c m } and d = {d 1 , d 2 , . . ., d n } as the sets of representations for chemical and disease entity mentions, respectively.Here, m and n are the numbers of mentions of each entity type.We apply a linear transformation with the tanh activation function to reduce the dimension of each chemical and disease vector.
The final representations c final i and d final j are final representations for ith chemical mention and jth disease mention, respectively, which are calculated using the following equations: where W c and W d are the model weights, and b c and b d are the corresponding bias vector for chemical and disease entities, respectively.
To calculate the prediction score for each entity mention pair, we utilize their final vectors and the relative distance between these mentions.We compute a two-dimensional vector that represents whether or not there is CID relation between the two target entities.
Formally, the score a i j is computed as follows: In the equation above, W score and b score are the model parameters, and R p c i − p d j represents the embedding of the relative distance between two entity mentions.This embedding is randomly initialized and being updated during training.

Named Entity Recognition
Previous studies have demonstrated performance improvements by incorporating named entity recognition as an auxiliary task for relation extraction [2].In this work, we also investigate the effectiveness of joint training of relation extraction and named entity recognition (NER) for enhancing the performance of the CID extraction.We predict entity labels for each token by feeding the LSTM network output's h t as input to a linear classifier where W ner , b ner are model parameters.Furthermore, we use the standard IOB format to encode the entity boundary.

Training
We employ softmax functions to compute a probability distribution for both relation extraction and named entity recognition tasks.
For relation extraction, we exert the softmax function to the entity-level prediction score to obtain a probability distribution over the set of relation labels To optimize the model, we minimize the negative loglikelihood of the ground-truth relation label given the input dependency graph and our model parameters θ re Here, r * c,d is the ground-truth relation between the chemical entity c and the disease entity d.
For the named entity recognition, we utilize the softmax function to compute a probability distribution over the set of entity labels based on the entity label score l t of token w t as input To train the named entity recognition model, we optimize the negative log-likelihood of ground-truth entity labels given the input sequence w 1 , w 2 , . . ., w n and our model parameters θ ner Here, y * t denotes the ground-truth entity label for token w t .
In multi-task setting, we jointly train the named entity recognition and relation extraction tasks, which share all embeddings and LSTM network parameters.The overall loss is computed as the weighted sum of the relation extraction loss (l re ) and named entity recognition loss (l ner ) Here, λ 1 , λ 2 are coefficients that determine the importance of each loss, being selected by performing cross-validation.

Dataset
We use the BioCreative V CDR (chemical-induced disease relations) corpus [1] for training, validating, and evaluating our model.This corpus consists of 1500 PudMed abstracts, of which one third (i.e., 500) is allocated for training, development, and test sets, respectively.Table 1 provides some an overview of the CDR corpus statistics.
In our study, we utilize the golden entity annotations provided in the CDR V corpus.The model is trained with the training set, and its hyper-parameters are tuned on the development set.Subsequently, we train the model using both the training and development sets and then conduct a final evaluation on the test set.To assess the performance of ours, we employ the standard F1-score (on the test set) as the evaluation metric.

Experimental Settings
In our experiments, we exert a complete biomedical text processing pipeline, namely ScispaCy [36], for word tokenization, dependency parsing, and coarse-grained POS tagging.The dimensions of the POS embeddings and the edge embeddings are both set to 10.We utilize the character embeddings and BioELMO embeddings with the dimensions of 30 and 1024, respectively.
For the LSTM network, we set the hidden state dimension to 150.The node hidden states and node cell states are both 150-dimensional vectors.The dimension of the final representation for each entity is set to 100.We encode the relative distance between two mentions of two entities as a 50-dimensional vector.The distance embedding dimension is set to 100.Furthermore, we set the number of graph steps (denoted as T ) to 6.
During the model training, we employ the AdamW optimizer [37] with a learning rate of 7e−4 and a weight decay of 0.01.The minibatch size is set to 8. The epoch number is set to 3. The dropout rate is set to 0.2.Two regularization parameters λ 1 , λ 2 are both set to 1 after performing crossvalidation.

Effect of Input Representation
In this experiment, we investigate the impact of incorporating additional input features into the ELMo word representation and assess their effectiveness.We observe that the inclusion of the POS embeddings yields an improvement of 0.2% in the F1 score, reaching a score of 65.0%.This enhancement demonstrates the meaningful contribution of POS information in the CID relation extraction.
Furthermore, incorporating the character embeddings further increases our model's performance from F1 of 65.0 to 65.7%.On the other hand, when we substituting the contextual word embeddings BioELMo with the static word embedding BioWord2Vec [38], the model's performance significantly declines from 65.7% in F1 down to 55.7%.This indicates that contextual word embeddings can generate more informative word representations, which are more beneficial for the CID extraction task.Table 2 presents a summary of our proposed model's performance using different input representations.

Effect of State Transition Process
To evaluate the usefulness of the state transition process, we conduct a similar ablation experiment as done with the input representation.In this experiment, we utilize the Graph LSTM for all types of embeddings, including ELMo embedding, character embedding, POS embedding and distance embedding.We, however, remove the state transition module from the Graph LSTM model, which means that the LSTM's output is directly used for the entity-level prediction.The results of this experiment are presented in Table 3.
Table 3 demonstrates the crucial role of the state transition process in our proposed model performance, as removing it causes a significant decrease of 2.1% in the F1 score.We note that the inclusion of the graph state transition process does not change the size of DEGREx, which is of 100 million parameters.
We did thorough experiments to investigate whether DEGREx has to pay significantly more computation cost for the superior performance when adding the fully connected Graph LSTM.The results show that the inclusion of the graph state transition process causes DEGREx to complete training 0.5 min slower, from 14 to 14.5 min.Similarly, the average inference time for a test input document is 36 s slower when adding the graph state transition process, i.e., increasing from 1.16 to 2.23 min.
In addition, we explore the capability of the State Transition module in capturing inter-sentence relations.For this purpose, we create a subset of the input document where each sentence does not contain any chemical-disease mention pair.This subset servers as the input for our Graph LSTM model.The performance of our proposed model in prediction of inter-sentence CID relations is shown in Table 4.
As demonstrated in Table 4, the removal of the state transition model results in a notable decrease of 1.2% in F1 for predicting inter-sentence CID relations.This underscores the critical importance of the state transition process in predicting such CID relations.

Effect of Multi-Task Learning
To investigate the efficacy of multi-task learning, we conduct two experiments as following.In the first experiment, we utilize the Graph LSTM solely for learning a single task, which is the CID relation extraction.In the second experiment, we employ multi-task learning to simultaneously train our model on both the relation extraction and the named entity recognition tasks by optimizing the joint loss function.Table 5 shows the performance of our model in these two experiments.The incorporation of multi-task learning has proven to be beneficial for our model, leading to a notable improvement in performance.Particularly, joint training with named entity recognition task results in an increase in F1 from 66.0 to 66.8%.This improvement (0.8% in F1) highlights the effectiveness of jointly integrating named entity recognition in enhancing entity representations, thereby improving the prediction of CID relations.
Compared against eight models introduced between 2020 and 2021, our model outperforms with large margins of F1 score, ranging from 0.9 to 2.4%.We note that GCNN introduced in 2019 [3], which utilizes a labeled edge graph convolutional neural network, achieves the lowest F1 score of 58.6% among the nine compared models.Interestingly, our model still outperforms the recent RC model proposed in 2022, exposing a 0.7% improvement in F1 score.Out of nine compared models, RC is the only one achieves a well-balanced precision and recall for document-level CID relation extraction, which is similar to our model.Although our model has the lowest recall, it exhibits the highest precision, indicating that our proposed model could yield the most rigorous CID predictions.We note that when finetuning the prediction threshold, our model could achieve a recall of 75.0%, a precision of 63.1%, and F1 of 68.5%.The GCN + multi-head Attn model [16] hits the highest recall at 72.7%, which is 7.5% better than ours.However, their approach integrated various predefined rules to construct training instances, which can remove a large amount of noisy entity mention pairs.It is worth noting that our model performs significantly better overall, with a 3.3% higher F1 score compared to the GCN + multi-head Attn model.

Discussion and Conclusion
In this study, we propose an approach to tackle the challenge of document-level chemical-induced disease (CID) relation extraction from the biomedical literature.proposed model REGREx constructs a unified representation graph for each input document to capture dependency information across multiple sentences.We enhance the graph representation through a state transition process, which is inspired by the controller gates in the LSTM network.Additionally, we incorporate state-of-the-art biomedical contextual word embeddings (i.e., BioELMo in our case) to enrich the input of the graph LSTM.Furthermore, we adopt a multi-task learning framework for jointly training relation extraction with named entity recognition.
Experimental results on the BioCreative V CDR benchmark corpus demonstrate the effectiveness and competitiveness of REGREx.In the single-task setting, our model achieves an F1 score of 66.0%, while in the multi-task setting, it achieves an F1 score of 66.8%.With an F1 score of 66.8%, our model is superior to all the nine compared recent state-of-the-art CID relation extraction models.
Labeling data for CID relation extraction is a timeconsuming and labor-intensive task.For future work, we plan to enhance our model with a semi-supervised learning framework, specifically the self-training method.This approach will leverage a large amount of in-domain unlabeled data to further improve the performance of our model.

Fig. 1
Fig. 1 An illustration of the state transition process and b x , x ∈ {i, f , o, c} are the model parameters.Moreover, we use two separate LSTM networks known as Forward LSTM and Backward LSTM, to capture the context information in both left-to-right and right-to-left directions simultaneously.For each token embedding x t , we generate a final hidden state h t , which is obtained by concatenating the Forward hidden state h

Table 1
BioCreative V CDR corpus statistics

Table 2 The
ELMo, Char, and POS denote for ELMo embedding, character embedding, and POS embedding, respectively.The best results are highlighted in bold, while the lowest results are indicated with italicized and underlined text

Table 6
Experimental results of our proposed model REGREx and other related models on the CID extraction taskThe best results are highlighted in bold, while the lowest results are indicated with italicized and underlined text.The reported results of our model are the average obtained from ten trials on the BioCreative V CDR test set