Keywords

1 Introduction

Adverse Drug Events (ADEs) are injuries resulting from medical intervention related to a drug [7]. A typical way to detect ADEs is to conduct a clinical trial. However, there are many settings where a drug would be used, and we cannot check all of them during the clinical trial. Besides, some ADEs have long latency, making them hard to be discovered by an ordinary clinical trial [29]. Post-marketing drug safety surveillance, also called pharmacovigilance, is conducted to solve these problems. Pharmacovigilance activities mostly depend on Spontaneous Reporting Systems, which collect usersā€™ voluntary ADE reports [18]. However, the number of people willing to report their experiences through the official systems is negligible. Furthermore, these systems are limited due to biased and incomplete reports.

Compared with reports using Spontaneous Reporting Systems, more people often talk about their adverse reactions on social media platforms. Recent publications collect documents from social media such as Twitter and Reddit to obtain more reliable data and detect ADEs automatically using Nature Language Processing (NLP) techniques. The detection of ADEs can be seen as a text classification task or a sequence-labeling problem, where we need to identify documents including ADEs [8]. The early studies include lexicon-based and rule-based methods [28]. These methods focus on string-matching, which is less effective for social media text and consumes many resources to build rules. Machine learning algorithms are also used to solve this task, such as Support Vector Machine (SVM) [4], Recurrent Neural Network (RNNs) [5] and Convolutional Neural Networks (CNNs)Ā [10]. These approaches can process text with manual feature engineering or enable automated feature learning with deep learning methods, facilitating automated ADE detection from biomedical text or social content. However, the existing approaches and models have two limitations: (1) some works are limited in capturing the rich context information in the text. (2) some do not fully consider the causal relationship and dependency between different entities in a document. Effective text encoding should be considered for the ADE detection task to capture rich semantic and contextualized information. Note that detecting causal relationships does not here refer to causal inference as in the field of machine learning focusing on causalityĀ [24], but rather expressing or indicating the relationship between the cause, e.g. a drug taken, and the respective individualā€™s adverse health outcome as reported in the text sample.

Graphs are commonly used for different data representations because of their strong expressivity. Text data can be represented by heterogeneous graphs, where different words, phrases, and documents are seen as nodes, and their relations are shown using edges. Text graphs and graph neural networks are widely used in many NLP applications for healthcare tasks such as sentiment classification and review rating [20, 35]. Graph Neural Networks (GNNs) [33] can be applied to graph representation learning and capture the causal relationships and dependency of objects, making them more suitable for representing text with adverse drug events. However, no existing studies on ADE detection employ graph representation and graph neural networks. Besides, contextualized representations of text facilitate various NLP applications and boost the performance of NLP systems with minimal architecture engineering. In the medical domain, contextualized embeddings with domain knowledge are also in need. Pretrained contextualized language embeddings have been applied to various medical applications such as medical code assignmentĀ [11] and biomedical knowledge graph constructionĀ [12].

This paper presents a contextualized graph embedding model for ADE detection. We build contextualized language embeddings to capture contextualized information. With a heterogeneous graph built to embody word and document relations from the ADE corpus, we use graph neural networks to learn causal relations between word and document nodes to improve adverse drug reaction detection. This paper deploys different GNN-based models and pre-trained contextualized embeddings. The performance of these models is evaluated and compared with state-of-the-art models on three public benchmarks for ADE detection. Our model outperforms several strong ADE detection models in most cases. We also analyze the experiment results to discuss some potential challenges and explore the potential for improving the ADE detection tasks. The code will be made publicly available on acceptance.

Our contributions include the following folds.

  • We develop a contextualized graph embedding model (CGEM) that introduces text graphs to capture the cause-effect relation for drug adverse event detection.

  • The CGEM model utilizes contextualized embeddings pre-trained in large-scale domain-specific corpora for capturing context information, convolutional GNNs for text graph encoding, and an attention classifier for ADE classification.

  • Experimental results show our approach outperforms recent advanced ADE detection models in three public datasets from the biomedical domain and social media.

2 Related Work

The rapid development of deep learning makes neural network-based approaches predominant in ADE detection. RNN can process sequence information and capture the sequential dependency, making it is suitable for ADE detection from text. Many studies on the ADE detection task employ RNN-based models. Cocos et al. [5] developed a Bidirectional Long Short-Term Memory (BiLSTM) network to label different parts of a sequence for ADE detection. Information from recognition of concepts and relations can benefit each other, enabling this joint modeling technique to obtain more useful information during learning. However, inaccurate recognition in the first step will affect the following steps, known as the error propagation issue. To address this issue, Wei et al. [31] proposed a joint learning model which can recognize entities of ADE, the reason, and their relations simultaneously. In the recognition phase, the joint model employs CRF and BiLSTM. To achieve relation classification, it uses CNN-RNN and SVM.

Some studies also developed models with other neural network architectures, such as capsule networks and the self-attention mechanism. Zhang et al. [38] presented a model called Gated iterative capsule network (GICN), which applies CNN to obtain the complete phrase information and extracts deep semantic information using a capsule network with a gated iteration unit. This unit can remember contextual information by clustering features. However, they did not consider the wights of different parts of a document. With attention mechanisms, more critical parts of a document get higher weights. Ge et al. [9] employed Multi-Head Self-Attention in their model to distinguish the importance of different words. Wunnava et al. [34] developed a dual-attention mechanism with BiLSTM to capture both task-specific and semantic information in the sentence. However, they did not fully consider the causal relationship between entities in a document.

3 Methods

3.1 Overall Architecture

This paper defines ADE detection as a classification task. We develop the contextualized graph embedding model as illustrated in Fig.Ā 1. There are three components of the model. (1) Graph Construction with Contextualized Embeddings. We construct a heterogeneous graph to represent words and documents in the whole dataset, following TextGCNĀ [35], and use pre-trained language models, specifically BERTĀ [6] and its domain-specific variants, to obtain the contextualized text representation. (2) Graph-based Text Encoding. To capture neighborhood information in the heterogeneous graph, the feature matrix obtained from the embedding layer and the adjacency matrix from the constructed graph are fed into graph encoders. The feature embeddings are iteratively updated in the heterogeneous relational networks of words and documents. (3) ADE Classification. We follow the BertGCN model [20] to fuse contextualized embedding and graph networks with a weight coefficient to balance these two branches. Furthermore, we build an attentive classification layer to allow more critical content to contribute more to predictions. FigureĀ 1 shows the overall model architecture. The details of these components are introduced in the following sections.

Fig. 1.
figure 1

The illustration of the model architecture with contextualized graph embeddings for ADE detection

3.2 Graph Construction

Heterogeneous Graph. We first represent text as a graph before feeding it to neural networks. Representing text in a heterogeneous graph can provide different perspectives for text encoding and improve ADE detection. The process of graph construction follows TextGCNĀ [35]. Nodes in the graph represent documents and different words. The number of nodes n equal to the number of documents \(n_{d}\) plus the number of unique words \(n_w\) in the whole dataset, i.e., \(n=n_{d}+n_w\). There are two types of edges, i.e., word-word and document-word edges. We use the term frequency-inverse document frequency (TF-IDF) of one word in the document to represent the weight of a document-word edge, while the weight of a word-word edge is based on positive point-wise mutual information (PMI) of two words. We can represent the weight between the node i and the node j as:

$$\begin{aligned} \textbf{A}_{ij}= \left\{ \begin{array}{lr} \text {PMI}(i,j), &{} {\text {PMI}}>0; \text {i, j: words}\\ \text {TF-IDF}_\textrm{ij}, &{} \text {i: document, j: word}\\ 1, &{} i=j\\ 0, &{} \text {otherwise} \end{array}. \right. \end{aligned}$$
(1)

Contextualized Embeddings. We used three pre-trained contextualized language models to obtain embeddings for documents. They are all BERT-based models but pre-trained with different strategies or corpora collected from different domains. The pre-trained language embeddings include:

  • RoBERTaĀ [21]: a pre-trained model with masked language modeling (MLM) objective on English language. In this paper, we used the base version.

  • BioBERTĀ [17]: a BERT-based model trained with biomedical corpora including PubMed abstracts and PubMed Central full-text articles.

  • ClinicalBERTĀ [2]: another domain-specific BERT-based model which is trained on clinical notes from the MIMIC-III database [13].

Given the dimension of embeddings denoted as d, the final output of contextualized text encoding are denoted as \(\textbf{H}_{doc} \in \mathbb {R}^{{n_d}\times d} \). We then apply a zero matrix as the initialization of word nodes to get the feature matrix input to GNN:

$$\begin{aligned} \mathbf {H^{(0)}}= \left( \begin{array}{cc} \textbf{H}_{doc} \\ \textbf{0} \\ \end{array}\right) \end{aligned}$$
(2)

where \(\mathbf {H^{(0)}} \in \mathbb {R}^{{(n_d+n_w)}\times d}\).

3.3 Graph-Based Text Encoding

This section employs a graph-based model for text encoding and capturing complex heterogeneous relationships. Graph neural networks are powerful models to mine and capture the relations and dependencies of graph data. Specifically, we apply two graph neural networks, i.e., Graph Convolutional Network (GCN) [16] and Graph Attention Network (GAT) [30], which are commonly used in different tasks. Graph convolution encodes the topological structure of the heterogeneous graph, enables label influence propagation, and achieves effective modeling of ADE corpora. In this section, we introduce their principles.

GCN is a category of Convolutional Graph Neural Networks (ConvGNNs) models. It is a spectral-based model which incorporates nodesā€™ feature information from their neighbors. It can be seen as a multilayer neural network limited to undirected graphs where the number of layers is fixed. Each layer has different weights to better process cyclic mutual dependencies. GCN is the approximations and simplifications of Spectral CNN. It approximates spectral graph convolutions using convolutional architecture to get a localized first-order representation.

A graph G consists of nodes set V, and edge sets E. \(\textbf{A}\) is the adjacency matrix obtained from the step of graph construction, and \(\mathbf {\hat{A}}\) is its normalized form. \(\textbf{D}\) is the degree matrix, where \(\textbf{D}_{ij} = \sum _{j}\textbf{A}_{ij}\). In the GCN model, multiple layers are stacked to integrate information about higher-order neighborhoods. In the m-th layer, the feature matrix is updated as:

$$\begin{aligned} \textbf{H}^{(m)}=f(\mathbf {\hat{A}}\textbf{H}^{(m-1)}\textbf{W}^{(m-1)}), ~~~~~~ \mathbf {\hat{A}}=\textbf{D}^{-\frac{1}{2}}\textbf{A}\textbf{D}^{\frac{1}{2}}, \end{aligned}$$
(3)

where \(\textbf{H}^{(m)}\in \mathbb {R}^{n\times d_m}\), and \(\textbf{W}^{(m-1)} \in \mathbb {R}^{d_{m-1}\times d_m}\) is the weight matrix, \(\textbf{H}^{(0)}\) is the output from contextualized language models, and \(f(\cdot )\) is an activation function.

Being similar to GCN, GAT is also a ConvGNNs model. However, it is spatial-based neural networks, where node information is propagated within edges and graph convolutions are finally decided by the spatial relation. It employs the message passing process and attention mechanism to learn relations between nodes. Graph attention layers in GAT assign different attention scores to one nodeā€™s distant neighbors and prioritize the importance of different types of nodes.

3.4 Classification Layers

The GNN-based text encoding produces hidden feature representations \(\textbf{H} \in \mathbb {R}^{n\times d_c}\). We propose to use an attention mechanism (Eq.Ā 4) to put more attention on nodes with more important information related to positive or negative ADE classes, denoted as

$$\begin{aligned} \textbf{s} = {\text {softmax}}(\mathbf {w_a}\textbf{H}^{{\text {T}}}), \end{aligned}$$
(4)

where \(\mathbf {w_a}\) \(\in \mathbb {R}^{d_c}\) and \(\textbf{s}= ({s_1}, {s_2}, \cdots , {s_n})\in \mathbb {R}^{n}\) is the attention weight vector containing attention score of each node. Attention scores from the attentive classification layer are different from the attention layer of GAT. Here, attention scores measure which nodes are more important to the graph, while in the attention layer of GAT, attention scores decide the importance of one node to the other node in the neighborhood. The weight is assigned to feature matrix to obtained attentive hidden representation weighted by attention scores, i.e., \(\mathbf {H_a} = [s_1\times \mathbf {h'_1}, s_2\times \mathbf {h'_2}, ..., s_n\times \mathbf {h'_n}]\).

Then, we apply the softmax classifier over the graph-based encoding and obtain the probability of each class as:

$$\begin{aligned} \textbf{p}_{g} = {\text {softmax}}(\mathbf {W_f}\mathbf {H^{{\text {T}}}_a}\mathbf {v_f}), \end{aligned}$$
(5)

where \(\mathbf {W_f}\) \(\in \mathbb {R}^{n\times d_h}\) and \({\mathbf {v_f}}\) \(\in \mathbb {R}^{n\times 2}\) are the weight matrices. We apply the same calculation as Eq.Ā 5 but with different weight matrices to pretrained contextualized embeddings \(\mathbf {H^{(0)}}\). Finally, we get \(\textbf{p}_{c}\) as the prediction probability from the contextualized embeddings. A weight coefficient \(\lambda \in [0,1)\) is introduced to balance the result from graph-based encoding models and the result from BERT-based contextualized models:

$$\begin{aligned} \textbf{p} = \lambda \textbf{p}_{g}+(1-\lambda )\textbf{p}_{c}. \end{aligned}$$
(6)

This weighted strategy can also be viewed as an ensemble of two classifiers or the interpolation of the prediction probability of two classifiers.

3.5 Model Training

We apply the negative log-likelihood loss function as the training objective. Because data in one of the datasets used in our study is imbalanced and the number of instances of this dataset is not large where the downsampling method is not suitable, we use the weighted negative log-likelihood loss function to solve the data imbalance problem [27]. Assuming that the number of documents containing ADE is \(N_1\) and the number of documents not containing ADE is \(N_2\), the weight \(w_+\) for documents predicted as positive samples is \(\frac{N_2}{N_1+N_2}\) and the weight \(w_-\) for documents predicted as negative samples is \(\frac{N_1}{N_1+N_2}\). The weighted loss function is:

$$\begin{aligned} L=-\frac{1}{N}\sum _{i=1}^N (w_{+} y_i\log (p_i)+w_{-}(1-y_i)\log (1-p_i)), \end{aligned}$$
(7)

where N is the number of documents in one batch and \(y_i\) is the true label of a document. When a document contains ADE, \(y_i\) equals to 1; otherwise, \(y_i\) equals to 0. The Adam optimizerĀ [15] is used for model optimization. To control the learning rate, we use the multiple-step learning rate scheduler. The learning rate scheduler decays the learning rate by the parameter \(\gamma \) when the number of epochs reaches a specific number.

4 Experiment

4.1 Data andĀ Pre-processing

We used three datasets from the biomedical domain and social media to evaluate the performance of baselines and our model. The details of these datasets are shown in TableĀ 1. We perform data pre-processing before building graph representation. Specifically, stop words, punctuation, and numbers are removed. For the data collected from Twitter, we use the tweet-preprocessor Python packageFootnote 1 to remove URLs, emojis, and some reserved words for tweets.

Table 1. A statistical summary of datasets

TwiMed-Twitter and TwiMed-PubFootnote 2. The TwiMed datasetĀ [3] includes two sets collected from different domains, i.e., TwiMed-Twitter and TwiMed-Pub. They consist of documents from Twitter and PubMed, respectively. People with different backgrounds annotate diseases, symptoms, drugs, and their relations in each document. There are three types of relations: Outcome-negative, Outcome-positive, and Reason-to-use. When a document is annotated as outcome-negative, it is marked as ADE (positive). Otherwise, we mark it as non-ADE (negative). The TwiMed-Pub has a small number of documents containing ADEs. The weighted loss function is used to solve the issue of imbalanced classification. Models are evaluated by 10-fold cross-validation.

SMM4H DatasetFootnote 3 [22, 26]. The dataset is from Social Media Mining for Health Applications (#SMM4H) shared tasks. Documents collected from Twitter contain a description of drugs and diseases. The dataset contains 17,385 tweets for training and 915 tweets for testing. In our experiment, since this dataset is large enough, we conduct downsampling to mitigate the problem of imbalance, where we only use 2418 tweets, half of which are negative (non-ADE) and the other half are positive (ADE). The training tweets are split into train and validation sets, with a ratio of 9:1. We use the official validation set to evaluate the model performance for a fair comparison with baseline models developed in the SMM4H shared task, such as [14, 25, 36].

4.2 Baselines, Evaluation andĀ Setup

Precision (P), Recall (R), and F1-score are commonly used to measure different models in a classification task. We report these three metrics in our results and mainly use the F1-score to compare modelsā€™ performance in our experiments. We consider two sets of baseline models for performance comparison: 1) models explicitly designed for ADE detection and 2) pre-trained contextualized models.

Customized models for ADE detection include:

  • CNN-TransferĀ [19] (CNN-T for short): a CNN-based model with transfer learning module. It has two sentence classifiers and a shared feature extractor based on CNN.

  • HTR-MSAĀ [32]: a model with hierarchical tweet representation and multi-head self-attention. This model learns word representations and tweet representations with CNN and Bi-LSTM. The multi-head self-attention mechanism is also applied.

  • ATLĀ [19]: a model based on adversarial transfer learning for the ADE detection, where corpus-shared features are exploited.

  • MSAMĀ [37]: a model with the multihop self-attention mechanism. It captures contextual information using Bi-LSTM and applies an attention mechanism in multiple steps to generate semantic representations of sentences.

  • IANĀ [1]: interactive attention networks, a model to interactively learn attentions in the context and model targets and context separately.

We compare our model with pre-trained language models on the SMM4H dataset as it is a recent dataset not studied by the aforementioned ADE detection baselines. We use the base version of pretrained models in our experiments for a fair comparison, which is the same setting as in the compared baselines.

  • BERTĀ [6]: a language representation models pre-training with unlabeled text. Yaseen et al.Ā [36] proposed a model that combined LSTM with a BERT encoder for ADE detection, denoted as BERT-LSTM in this paper.

  • RoBERTaĀ [21]: a BERT-based model on the English language with slightly different pre-training strategies. Pimpalkhute et al.Ā [25] developed a data augmentation method with RoBERTa text encoder for ADE detection, denoted as RoBERTa-aug in this paper.

  • BERTweetĀ [23]: a domain-specific model for English Tweets with the same architecture as BERT-base. Kayastha et al.Ā [14] built a model with BERTweet and single-layer BiLSTM for ADE detection, denoted as BERTweet-LSTM in this paper.

We use Python 3.7 and PyTorch 1.7.1 to implement the model. The hyper-parameters we tuned in our experiments are presented in TableĀ 2. In our experiment, we set the hyper-parameter of the learning rate scheduler \(\gamma \) and the milestone of epoch number to 0.1 and 30, respectively.

Table 2. Choices of hyper-parameters

4.3 Main Results

We compared our model with baseline models for the ADE detection task to validate the performance of our model. TablesĀ 3 and TableĀ 4 show the results of TwiMed and SMM4H dataset, respectively. Our model achieves the best performance for all datasets compared with other methods in terms of F1-score. The best result of TwiMed-Pub is obtained with ClinicalBERT embeddings and a GAT encoder. As for SMM4H and TwiMed-Twitter, the best combination of building blocks is RoBERTa embeddings and GCN encoder.

Table 3. Results of TwiMed datasets
Table 4. Results of SMM4H dataset

As shown in TableĀ 3, performances of HTR-MSA, ATL, and CNN-Transfer are lower than others. The network structures of these three models are complex, resulting in a large amount of data being required. Thus, it performs worse than other models on small corpora. MSAM achieves the best performance on recall, while our model performs the best on precision and F1-score. Our model can balance precision and recall better. The competitive performance on the three datasets also shows the high generalization ability of our model. In TableĀ 3, the performances of most models on the two datasets are significantly different. It is challenging to detect ADEs from tweets since tweets are informal text and contain much colloquial language. However, our model performs well on the TwiMed-Twitter dataset, showing that it can effectively encode information from the informal text and better capture relationships of entities in a document. From TableĀ 4, we can find that other models are all BERT-based models. In contrast, our model employs GNN architectures, which suggests GNN can significantly improve modelsā€™ performance on this task.

4.4 Analyses andĀ Discussion

We further analyze the contextualized graph embedding model in this section, discuss the choice of different building blocks, and conduct a case study.

Choice of Graph Encoders. Our experiment examines GCN and GAT to study which one is more suitable for the ADE detection task. We record the best result under different graph encoders. For both GCN and GAT, we obtain the best result from RoBERTa for the SMM4H dataset and TwiMed-Twitter. For TwiMed-Pub, the best result is obtained using ClinicalBERT. From TableĀ 5, we can find the results from the two GNNs are similar, showing that they both performed well on this task.

Table 5. Comparison on the choices of graph encoders, i.e., GCN and GAT

Choice of Pretrained Embeddings. We examine three contextualized language models in our experiment. We record the best results with different language models. When using RoBERTa, the best results for the SMM4H dataset, TwiMed-Pub, and TwiMed-Twitter are from GCN, GAT, and GCN, respectively. When using ClinicalBERT, the best results for the SMM4H dataset and TwiMed-Pub are from GAT, and for TwiMed-Twitter, the best result is from GCN. When using BioBERT, the choice of GNNs for best results is the same as using ClinicalBERT.

From TableĀ 6, we can find that, for TwiMed-Pub, there is little difference among the three pre-trained language models. However, for the SMM4H dataset and TwiMed-Twitter, RoBERTa performs better than others. The SMM4H dataset and TwiMed-Twitter dataset contain documents with many non-medical terms, while ClinicalBERT and BioBERT are trained with many medical terms. Therefore, when there are insufficient medical terms in the text, ClinicalBERT and BioBERT are unsuitable. RoBERTa is a better choice for informal text for this task.

Table 6. The effect of contextualized text embeddings obtained pretrained from different domains

Ablation Study on the Attention Classifier. To examine the effect of the attention classifier, we conduct an ablation study in our experiment. We remove the attentive classification layer and check the performance change in F1 scores.

From TableĀ 7, we can find that after removing the attentive classification layer, values of F1-scores get decreased for all three datasets. It suggests that the attentive classification layer can improve the model to prioritize information in the heterogeneous graph. More meaningful content, such as the description of symptoms and drugs, medical terms, and other relevant information related to ADEs, can contribute more to final predictions by employing attention mechanisms in the classification layer.

We also notice that F1 scores increase with the attentive classification layer, while precision scores for the SMM4H and TwiMed-Twitter datasets decrease. The documents of these two datasets are both from Twitter. Tweets are informal texts that do not follow the logical order, and their structures are unclear. They lack medical terms, and some content that seems not to be related to ADEs may also help determine whether a document contains ADEs or not. After applying the attentive classification layer, the model puts more attention to parts directly related to the description of symptoms, resulting in a tendency where a tweet is more easily to be predicted as a positive sample. Therefore, the precision value decreases after employing the attention classification layer. Besides, we can find that the F1 score on the SMM4H dataset decreases to a greater extent without an attentive classification layer. This dataset contains more documents compared to others. It suggests that the attentive classification layer works better for larger datasets. For small corpora, models with simpler architectures also perform well.

Table 7. Comparison between our model and the model without attentive classification layer

Effect of Weight Coefficient \({\boldsymbol{\lambda }}.\) The weight coefficient \(\lambda \)ā€™s value controls the trade-off between the contextualized language models and graph neural networks. When \(\lambda \) equals zero, only BERT-based pre-trained contextualized embeddings are considered. In 2, dashed lines show the values of the F1-score when \(\lambda \) equals to zero. After employing GNNs (\(\lambda =0.1,0.3,0.5,0.7,0.9\)), we can find that the value of the F1-score increases on all three datasets. It demonstrates that convolutional GNNs can improve the performance of our model significantly. Determining whether a symptom description is about the disease itself or adverse reactions resulting from the disease is a challenge in ADE detection. Utilizing GNNs helps solve this issue since GNNs can better capture the cause-effect relation and dependency between different entities of documents.

We can find the trend of the three lines are similar in respective plots of Fig.Ā 2. In terms of F1-score, the best choices of the value of \(\lambda \) for three datasets are 0.5 (SMM4H), 0.9 (TwiMed-Pub), and 0.7 (TwiMed-Twitter). It suggests how to choose the value of \(\lambda \) depending on which datasets we use and other model hyper-parameters. Also, when values of \(\lambda \) are greater than 0.5, the F1 scores are relatively high. Therefore, we can first choose a high value of \(\lambda \) to allow graph embeddings to contribute more.

Fig. 2.
figure 2

The effect of weight coefficient \(\lambda \) on values of metrics

Case Study. We conduct a case study to explore the effect of the attention mechanism in Eq.Ā 4. We choose two documents classified as positive samples in the SMM4H test dataset, where one is classified correctly while the other one does not contain ADE. We record the attention scores of words of these two tweets and utilize a heap map to show the value of different wordsā€™ attention scores in a document, illustrated in Fig.Ā 3. FigureĀ 3a of a correctly classified tweet shows nouns (such as medication, sideaffects and seroquel), verbs (such as jolting), and sentiment words (such as hard and bad) related to drugs and symptoms get high attention scores. It helps the model put more attention on these important words. However, assigning high attention scores to such words does not ensure correct predictions. FigureĀ 3b shows the attention scores of a tweet incorrectly classified as a positive sample. We can find that words related to symptoms, negative sentiment, and drugs are still getting high scores, while the tweet does not talk about ADE directly.

Fig. 3.
figure 3

Case study of the attention scores in two tweets: (a) with ADE; and (b) without ADE

5 Conclusion

The automated detection of adverse drug events from social content or biomedical literature requires the model to encode text information and capture the causal relation efficiently. This paper utilizes contextualized graph embeddings to learn contextual information and causal relations for ADE detection. We equip different convolutional graph neural networks with pre-trained language representation, develop an attention classifier to detect ADEs in documents and study the effects of different building components in our model. By comparing our model with other baseline methods, experiment results show that graph-based embeddings can better capture causal relationships and dependency between different entities in documents, leading to better detection performance.