FormalPara Key Points

Adverse drug events (ADEs) reported through spontaneous reporting systems (SRS) have raised concerns over the quality of drug safety information collected due to data incompleteness and under-reporting.

To address the issues, we formulated the extraction of comprehensive drug safety information from ADE narratives reported through Korean Adverse Event Reporting System (KAERS) as natural language processing tasks and developed the human-annotated corpus.

We developed KAERS-BERT, a domain-specific Korean BERT model, and utilized it to extract comprehensive drug safety information from ADE narratives, resulting in an average increase of 3.24% in data completeness in KAERS.

1 Introduction

Post-marketing surveillance is essential for monitoring and assessing adverse drug events (ADEs), harm resulting from appropriate or inappropriate use of a drug [1]. In many countries, ADEs are voluntarily reported through a spontaneous reporting system (SRS) as individual case safety reports (ICSRs) using a pre-structured format that investigates ADE(s) experienced by an individual patient [2, 3]. The regulatory agencies use the information on drug safety collected through an SRS to identify a potential safety concern and adjust strategies accordingly for efficient pharmacovigilance [3]. In Korea, the Korea Adverse Event Reporting System (KAERS) was established in 2012 by the Korea Institute of Drug Safety and Risk Management (KIDS) to facilitate reporting ADEs and their management. The number of ADE reports through SRS is substantial, partly thanks to electronic submission of ICSRs. For example, more than 200,000 cases have been reported through KAERS every year since 2016 [4]. In reality, having a large number of ADE reports is crucial for early detection of a safety issue and for discovering rare ADEs in the post-marketing phase [3].

However, concerns have been raised over the quality of drug safety information collected through SRS such as data incompleteness and under-reporting [5, 6]. Substandard data completeness has impeded the regulatory agency from appropriately assessing the relationship between ADEs and a drug based on ICSRs uploaded to an SRS. While KIDS has run an education program to improve the reporting quality of ICSRs, the completeness of several key data elements including drug indication and patient medical history was lower than 75% [7]. Moreover, problem of missing data has become more frequent as more comprehensive and lengthier forms have been introduced [8]. To address this issue, we present in this paper a natural language processing (NLP) model capable of automatically extracting comprehensive drug safety information from ADE narratives to improve the data completeness in the KAERS database. Although ADE narratives reported through KAERS have been collected alongside drug safety information, they have not been systematically used to evaluate drug safety until now.

Although many NLP models and human-annotated corpora extracting drug safety information [9,10,11,12,13,14,15] have been developed to improve the quality of ADE reporting. the focus of these models and corpora has primarily been on detecting ADEs and medication entities. Existing human-annotated corpora for detecting ADEs have rarely tried to extract other drug safety information that would be helpful and sometimes crucial for assessing causality between a drug and an ADE, including a temporal relationship between drug administration and ADE occurrence and ADEs at the last observation [16]. In this context, there is a requirement for developing novel NLP techniques capable of extracting and integrating the additional information present in free-texts that describe ADEs to improve the data completeness of drug safety information databases.

Thus, in this study, we developed NLP models that automatically extract comprehensive drug safety information from ADE narratives reported through KAERS, i.e., free-texts detailing one or more ADEs experienced by a patient and his/her clinical setting (Fig. 1). To this end, we constructed a manually annotated corpus and defined NLP tasks including named entity recognition (NER), sentence extraction, relation extraction, label classification, and entity normalization to formulate the extraction of comprehensive drug safety information from ADE narratives. In this study, we provided baseline models for NER, sentence extraction, relation extraction, and label classification. We observed that the completeness of data fields in KAERS was improved consistently by extracting drug safety information using the NER model we developed. Also, we pretrained domain-specific BERT (Bidirectional Encoder Representations from Transformers) models specialized in clinical texts, where code-switching between English and Korean is frequent. Furthermore, we investigated how the performance of extracting drug safety information improves when a training dataset consists of more diverse ADE narratives as an ablation study.

Fig. 1
figure 1

Illustration of a developing of natural language processing (NLP) models for extracting comprehensive drug safety information from adverse drug event (ADE) narratives reported through the Korea Adverse Event Reporting System (KAERS), and b an example of extracting comprehensive drug safety information from ADE narratives

2 Related Works

Several human-annotated corpora of clinical narratives from diverse sources have been introduced to define NLP tasks related to pharmacovigilance such as detection of ADEs and extraction of medication information from free-texts. The sources of those clinical narratives included clinical or physicians notes from electronic health records (EHRs) [9, 17], consumer reviews on medications [18], drug labels [10], social media [11,12,13,14,15], safety reports in the Vaccine Adverse Event Reporting System (VAERS) [19], and serious ADE reports collected during clinical trials [20]. These corpora principally focused on detecting ADEs and medication entities and normalizing detected entities to medical ontology standards. For example, CADEC (CSIRO Adverse Drug Event Corpus), sourced from posts on AskaPatient, an online medical forum dedicated to consumer reviews on medications, was created with two purposes: (1) entity identification for drugs, ADEs, symptoms, and diseases, and (2) entity normalization to the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT), AMT (the Australian Medicines Terminology), and MedDRA (the Medical Dictionary for Regulatory Activities) [18]. In 2018, the National NLP Clinical Changes (n2c2) shared a task and data on the extraction of ADEs and medication information from clinical narratives and tackled the NLP task in three steps: (1) concept extraction, (2) relation classification, and (3) construction of an end-to-end system integrating the two previous steps [9]. Also, medical text classifiers have been developed to extract ADEs from vaccine safety reports in VAERS [19, 21].

However, to the best of our knowledge, no human-annotated corpus of safety reports from SRS has ever been built except for those from VAERS. Furthermore, existing human-annotated corpora for detecting ADEs have rarely tried to extract other drug safety information that would be helpful, and sometimes crucial, for assessing causality between a drug and an ADE. This information includes, but is not limited to, temporal relationship between drug administration and ADE occurrence and response to the withdrawal of the drug [16]. For example, the 2018 n2c2 shared task provided the annotated corpus for nine areas including drug, reason (i.e., reason for medication or indication), and ADEs in discharge summaries, but it did not contain the temporal relationship between drug administration and ADE occurrence [9]. To overcome those shortfalls, we annotated word entities and relations between entities according to the data elements defined in the International Conference on Harmonisation (ICH) E2B(R3) guideline, which provides the formats and data requirements for electronic transmission of different types of ICSRs [22].

While numerous NLP models have been developed to extract ADEs from clinical narratives, and some showed acceptable performance in detecting ADEs and related word entities [9, 10, 13, 14, 17, 18], the robustness of the extraction performances is still questionable due to the narrow clinical context of the annotated corpora. For example, while the best systems achieved F1-scores of 0.82–0.86 for the NER and relation extraction in MADE 1.0 challenges, the annotated corpus consisted of longitudinal EHR notes of only 21 randomly selected cancer patients at a single hospital [17]. Likewise, CADEC only included consumer reviews on 12 drugs such as diclofenac or atorvastatin as their active ingredient [18], and human-annotated narratives in the VAERS only consisted of safety reports from patients with Guillain-Barre syndrome [19]. Therefore, extraction performance of an NLP system developed and evaluated only in limited clinical contexts could be lower in a wider clinical context. Furthermore, the risk of ADE occurrence and its description would differ according to the clinical setting for a patient who experiences an ADE (e.g., age, comorbidity, and existence of ADEs reporting system in the hospital) [23, 24].

Many pretrained transformer models specialized in the biomedical and clinical domains have also been developed. For example, BioBERT and clinical BERT outperformed previous text-mining models on NLP tasks in the clinical domain where drug, disease, and other medical terms are frequently written in their abbreviated forms [25, 26]. Furthermore, transformer-based language models achieved new state-of-the-art performance on datasets for extracting drug safety information including 2018 MADE 1.0 and 2018 n2c2 challenges [27,28,29]. Recently, a Korean medical BERT (KM-BERT) pretrained on medical textbooks and health information news was developed and has shown its applicability to biomedical NLP tasks in Korean [30]. However, no pretrained language model has ever been developed in both Korean and the clinical domain because large-scale Korean clinical narratives are scarce. Based on this understanding, we pretrained KAERS-BERT to the model parameters of KoBERT, a Korean BERT developed by SK Telecom in 2019, using 1.5 million ADE narratives reported in KAERS (github.com/SKTBrain/KoBERT).

3 Methods

3.1 Data Source

ADE narratives and structured drug safety information were obtained from 1.2 million ICSRs reported through KAERS between 1 January 2015 and 31 December 2019. We created documents for the extraction of drug safety information by concatenating five types of ADE narrative in an individual ICSR: disease history in detail, adverse event in detail, laboratory test in detail, reporter's opinion, and original reporter's opinion. Then, we removed documents originating from ADE narratives that were either too short, i.e., < 100 characters, or too long, i.e., > 740 characters, to control the reporting quality of ADE narratives and lighten the annotation burden. The 25th and 95th percentiles for the length of the documents were 100 and 740 characters, respectively. Additionally, we excluded documents where ADEs occurred during pregnancy because the data elements for ADEs experienced by pregnant women, infants, and children were not defined in our annotation system. Furthermore, we anonymized all the ADE narratives by replacing patient identification information including patient name, address, and hospital name with special tokens such as <NE-PERSON-NAME> or <NE-HOSP-NAME> using a rule-based de-identification algorithm that we developed.

We selected ADE narratives for annotation in two ways. First, we randomly selected ADE narratives for annotation. Second, to diversify the type of documents, we selected additional ADE narratives from ICSRs that contained the least-reported items related to drug safety information, i.e., adverse events, indications, drug compounds, and drug products. The least-reported items were determined based on the KIDS-KAERS Database (KIDS-KD), a structured drug safety information database created by KIDS. To compare the clinical context between total and annotated ICSRs, we performed an exploratory data analysis on the report types and structured drug safety information of ICSRs using KIDS-KD.

3.2 Annotation of Adverse Drug Event (ADE) Narratives

We defined data elements for the extraction of drug safety information from ADE narratives based on the ICH E2B(R3) guideline [22]. Furthermore, the data elements that were rarely described in ADE narratives (e.g., reporter's name, whether an autopsy was done) and related to pregnancy (e.g., parent information, gestation period at the time of exposure) were excluded from the annotation. Then, we developed an annotation guideline that defined word entities, relations between entities, and entity labels to formulate the extraction of drug safety information as NLP tasks. The annotation guideline was reviewed by three pharmacoepidemiology experts to ensure that entities and relations were correctly defined. The annotation guideline was converted into a web-based annotation system using the tagtog service (tagtog.com).

In this study, we defined 21 types of word entities to capture comprehensive drug safety information in the ICH E2B(R3) guideline. These word entities were divided into six categories: clinical finding, drug, dosing information, date, patient information, and others. ‘ADE’ and ‘Disease’ entities belong to clinical finding along with ‘ADE Seriousness’ and ‘ADE at last observation.’ A clear distinction between ‘ADE’ and ‘Disease’ is the key component for extracting drug safety information, because mistaking ‘Disease’ for ‘ADE’ could undermine the safety of medical products. Thus, we recognized signs, symptoms, and diseases diagnosed after the administration of the concerned drug as ‘ADE,’ while those diagnosed before the administration of the concerned drug were seen as ‘Disease.’ In the drug and dosing information categories, we defined word entities for capturing drug names (i.e., ‘Drug compound,’ ‘Drug product,’ and ‘Drug group’) and words describing dosing information (e.g., ‘Dose’ and ‘Dosing Interval’). ‘Date’ and ‘Date Period’ entities, collectively classified as the date category, were defined to capture the temporal information of disease diagnosis, ADE occurrence, drug administration, and more. Word entities that help to assess the causality between drugs and ADEs including ‘Test name,’ ‘Test result,’ ‘Non-drug treatment’ and ‘Action taken with drug’ were put into the others category. ‘WHO-UMC (World Health Organization-Uppsala Monitoring Centre) assessment’ entity is the only sentence entity in our annotation guideline.

In addition, we defined 49 types of relations between word entities. For example, a relation between ‘Disease’ and ‘Drug Compound’ indicated that the drug compound was prescribed for the disease. In the annotation guideline, we provided clear instructions on how to annotate a relation between two entities. Furthermore, six entity labels were created to give detailed information on annotated entities. For example, we put an ‘occurred’ label to ‘ADE,’ ‘Disease,’ and three drug entities, i.e., ‘Drug compound,’ ‘Drug product,’ and ‘Drug group, to denote whether a mentioned entity actually occurred in a patient or was administered to the patient. Likewise, a ‘concerned’ label was added to three drug entities to indicate whether a mentioned drug entity was a suspected drug. Then, we performed an entity normalization for ‘ADE,’ ‘Disease’ and three drug entities using MedDRA 24.0 (English and Korean) and the national drug code directory provided by the Ministry of Food and Drug Safety. Detailed explanations of word entities, relations, and entity labels in the annotation guideline can be found in the Online Supplementary Material (OSM) section 1.

3.3 Quality Control of Annotation

Five pharmacists who had experience in monitoring and reporting of ADEs in a pharmaceutical company or a pharmacy were recruited as annotators. They underwent a 1-week education program, through which they became familiar with the annotation guideline and performed preliminary annotations to understand how the annotation system works. Furthermore, confusing annotation examples were presented and explained to the annotators to help them understand the annotation principles in the guideline. Then, 80–120 documents a week were annotated by each annotator for 7 weeks. Ten percent of the documents were assigned to two different annotators at the same time in order to calculate the inter-annotator agreement. Around 4000 documents annotated by five annotators for the entire period were given dual annotations. In contrast, annotators performed an entity normalization for ADE and drug entities in only 20 documents a week to lighten their annotation burden.

To examine annotation quality, an independent reviewer (Siun Kim) separately reviewed 15% of documents annotated by the five annotators. When annotation agreement between the independent reviewer and an individual annotator was < 80% of documents, the annotator was asked to re-do all of the documents assigned to him or her for that week. In addition, we investigated all the annotated documents in the form of a JSON file to check whether the annotators accurately followed the annotation guideline. When an annotator was found to obviously violate the annotation guideline for a document (e.g., missing entity labels), we manually re-annotated the document. Annotation quality for NER was assessed using Cohen’s kappa [31] and defined as follows:

$$\kappa =\frac{P_{\mathrm{o}}-P_{\mathrm{c}}}{1-P_{\mathrm{c}}},$$

where Po represents the proportion of annotations that two annotators agreed with each other, and Pc represents the proportions of annotations agreed between two annotators by chance.

3.4 Task Formulation

We formalized a NER task as a token-level sequence classification based on the BIO scheme [32]. Annotated tokens were tagged using the BIO scheme, where a token at the beginning of, inside, and outside the entity was labeled as ‘B,’ ‘I,’ and ‘O,’ respectively. We gave a NER tag based on word-level tokenization instead of the WordPiece tokenization. Thus, we tagged and used the first WordPiece token in each word in training NER models (Fig. S1, OSM). Thus, WordPiece tokens tagged as ‘X’ were excluded in calculating a loss function and performance metrics. We combined ‘Date end’ and ‘Date start’ entities into a ‘Date’ entity. Likewise, ‘Event admission’ and ‘Event discharge’ entities were combined into an ‘Event hospitalization’ entity, resulting in a total of 19 entity types for the NER task.

Sentence extraction for the ‘WHO-UMC assessment’ entity was formalized as a token-level sequence classification with a binary IO scheme. Tokens within and outside the entity were labeled as ‘I’ and ‘O,’ respectively. Thus, the objective was to train sentence extraction models to determine whether a token is located within a ‘WHO-UMC assessment’ sentence. In ADE narratives, ‘WHO-UMC assessment’ sentences often contained other word entities like ‘Drug product’ and ‘ADE.’ This resulted in tokens with multiple entity types in the training dataset, making the NER task a complex multi-target classification if ‘WHO-UMC assessment’ was included as a target entity type. To avoid this complexity, ‘WHO-UMC assessment’ was extracted through a separate sentence extraction task, distinct from the NER task.

We formalized relation extraction as a pair-wise binary classification for token pairs, for which entity types were defined as possibly related to each other in the annotation guideline. In our study, we defined a negative relation as an entity pair that the annotators did not identify as related, out of all the possible entity pairs that could be related based on their entity types. Conversely, a positive relation was defined as an entity pair that the annotators had identified as related. To illustrate, consider all the entity pairs of ‘Drug product’ and ‘Drug dose’ in an ADE narrative, which could potentially be related since a relation between these two entity types exists in the annotation guideline. Therefore, we used all entity pairs of ‘Drug product’ and ‘Drug dose’ in the ADE narrative for which no relation was identified by annotators as negative relations. Although we originally defined 49 types of relations to capture comprehensive drug safety information in the ADE narratives, we opted to utilize only 15 types of relations in order to improve the quality of an annotation dataset and balance the ratio between the annotated and negative relations in a dataset (Table S3, OSM). Also, we limited the maximum number of negative relations in a single ADE narrative up to 40 to balance positive (i.e., annotated) and negative relations in the training dataset. We constructed input embedding for pair-wise classification of relations by concatenating three token embeddings from BERT, including the first token of the two entities forming the relation and the first token embedding the entire ADE narrative.

Label classification for ‘occurred’ and ‘concerned’ labels was formalized as a token-level sequence classification with three label types: (1) ‘positive,’ (2) ‘negative,’ and (3) ‘unrelated.’ We gave a token a ‘positive’ label when the token was annotated as ‘occurred’ or ‘concerned,’ while we considered a token ‘negative’ when the token was annotated as ‘not occurred’ or ‘not concerned.’ Word entities, for which neither ‘occurred’ nor ‘concerned’ label classification was applicable to the entity, were labeled as ‘unrelated.’ The input for label classification was the initial WordPiece token in every word, as was the case with the NER task.

3.5 Pretraining Korea Adverse Event Reporting System (KAERS)-Bidirectional Encoder Representations from Transformers (BERT)

Since ADE narratives collected through KAERS were written in Korean and contained a large number of medical terms and abbreviations, we newly developed a domain-specific Korean BERT (KAERS-BERT) model to incorporate the semantic knowledge from ADE narratives in KAERS. In detail, we built KAERS-BERT by pretraining KoBERT (github.com/SKTBrain/KoBERT) using masked language modeling from 1.2 million ADE narratives reported through KAERS. We only used ADE narratives in the ‘disease history in detail’ and ‘adverse event in detail’ to pretrain KAERS-BERT because ADE narratives in the ‘laboratory test in detail,’ ‘reporter's opinion,’ and ‘original reporter's opinion’ repeatedly described similar contents such as causality assessment of ADEs. The ADE narratives were tokenized using the KoBERT WordPiece tokenizer and we masked 15% of tokens at random with a ‘[MASK]’ token. We used a maximum sequence length of 200. A learning rate scheduling was \(5\times {10}^{5}\).

3.6 Model Development

We developed and trained four deep-learning baseline models for four defined extraction tasks such as NER, sentence extraction, relation extraction, and label classification (i.e., ‘occurred’ and ‘concerned’): (1) Word2Vec [33] + long short-term memory (LSTM), (2) Word2Vec + bi-LSTM (bidirectional LSTM), (3) KoBERT, and (4) KAERS-BERT. In all of the settings, we used the KoBERT Word-Piece tokenizer. The KoBERT model configurations were from Huggingface (huggingface.co/monologg/kobert). The LSTM and bi-LSTM models consisted of 300 hidden layers. We used the Adam optimizer [34] with a learning rate of \(5\times {10}^{5}\) to train all the baseline models, with drop-out probability and batch size set to 0.4 and 8, respectively.

We developed baseline models for the NER task by training BERT and RNN with CRF (conditional random fields) to capture the semantic dependency within a sentence [35]. We used a negative log-likelihood as a loss function of CRF for training. We predicted the tags of word entities to infer entity types of tokens by generating the tag sequence that maximizes log-likelihood via the Viterbi algorithm [36].

3.7 Model Evaluation

We split the annotated ADE narratives into training and test datasets with a proportion of 9:1. Subsequently, we obtained the validation dataset by performing another 9:1 random split from the training dataset constructed for each individual NLP task. We calculated precision, recall and F1-score to evaluate the information extraction performances of baseline models for NER, relation extraction, label prediction, and sentence extraction tasks. For NLP tasks that involved multi-class classification (i.e., NER and relation extraction), we reported two types of F1-scores to account for class imbalances in the test datasets: weighted F1-score and macro F1-score. The weighted F1-score was calculated by averaging the F1-scores for each class, weighted by the number of instances in each class, while the macro F1-score was computed as the average of the F1-scores for each class, with equal weighting. In all tasks, we reported precision and recall using the weighted method.

A unit sample for the NER task was a token that was assigned a NER tag except ‘X’ and the token’s NER tag, while a unit sample for the label classification was a token for which the label annotation and the token’s label (i.e., ‘concerned’ or ‘occurred’ label) was applicable to the entity. In the sentence extraction, all of the tokens in the dataset were used as an input token and it was predicted whether tokens were positioned inside a ‘WHO-UMC assessment’ sentence. In the relation extraction, we used positive and negative relations and their binary modality as unit samples.

The proportions of predicted entity types were reported to investigate how well the KAERS-BERT model performed in classifying word entities. We observed that entities labeled as ‘not occurred’ were more likely to appear as a list of the most common ADEs of a concerned drug in ADE narratives, e.g., “Tramadol can cause nausea, vomiting, constipation, or drowsiness”, while entities labeled as ‘occurred’ were not. For this reason, we were concerned that the difference in the way entities were written in an ADE narrative between those labeled as ‘occurred’ and ‘not occurred’ might have an effect on the NER performances. Thus, we assessed the NER performances of the KAERS-BERT model separately for ‘occurred’ and ‘not occurred’ entities.

Lastly, to assess the utility of NLP models developed in this study for their intended purpose, we measured the improvement in data completeness of structured data fields in KIDS-KD by extracting drug safety information from ADE narratives. The assessment encompassed all ICSRs reported through KAERS between 1 January 2015 and 31 December 2019, totaling around 1.2 million, regardless of whether ADE narratives were collected. Nevertheless, we excluded certain data fields that require the combined results of NER and relation extraction for identification, such as the first date of drug intake, due to the insufficient performance of the relation extraction model. Thus, we evaluated the completeness of data fields where drug safety information can be extracted using NER results alone, such as ‘Drug compound’ and ‘ADE at last observation.’

3.8 Ablation Experiment

We performed an ablation experiment to investigate whether a baseline NER model was improved when a training dataset contained more diverse ADE narratives. To this end, we created five training datasets consisting of: (1) 340 ADE narratives randomly selected from the total ICSRs and 340 ADE narratives from ICSRs that contained the least-reported items, i.e., (2) ADE, (3) indication, (4) drug compound, and (5) drug product. First, we trained KAERS-BERT using the five training datasets of 340 ADE narratives (M = 0). Then, we added more randomly selected ADE narratives to each of the five training datasets at numbers (M) of 340, 1020, and 1700. Thus, training dataset 1 consisted of only randomly selected 340 + M ADE narratives of (random only), while the other four datasets consisted of 340 ADE narratives randomly selected and M ADE narratives with the least-reported items (ADE + random, indication + random, drug compound + random, and drug product + random, respectively). The performance of baseline NER models was calculated as M was increased.

4 Results

4.1 Annotated Individual Case Safety Reports (ICSRs)

We annotated 3723 ADE narratives out of the 1,199,498 total ICSRs reported through KAERS between 1 January 2015 and 31 December 2019 (Table 1). The overall characteristics of ADE narratives were similar between the total and annotated ICSRs. A total of 235 (6.3%) ADE narratives were doubly annotated by two different annotators, and 580 (15.6%) by the independent reviewer and an annotator. The agreement was high not only between the annotators and the main reviewer but between any two annotators (Cohen’s kappa 96.5% and 85.9%, respectively; Table S1, OSM).

Table 1 Summary characteristics of the total and annotated individual case safety reports

4.2 Corpus Statistics

The annotated corpus contained a total of 86,750 entities extracted from 2378 randomly selected ADE narratives and 336, 336, 337, and 336 ADE narratives with the least reported ADEs, disease, drug compounds, and drug products, respectively (Table 2). All the entities defined in this study were annotated more than 300 times. The most frequent entity was ‘ADE’ (39.8%), while drug entities including ‘Drug compound,’ ‘Drug product,’ and ‘Drug group’ comprised 19.8% of the total annotated entities. The overall distributions of system organ class (SOC) were similar between the normalized ADE entities in annotated ADR narratives and the ADEs normalized by reporters in KIDS-KD (Fig. 2).

Table 2 Entities in annotated adverse drug event (ADE) narratives
Fig. 2
figure 2

The Medical Dictionary for Regulatory Activities (MedDRA) system organ classes (SOCs) distribution of normalized ADE entities in annotated ADE narratives and adverse drug events (ADEs) normalized by reporters in Korea Institute of Drug Safety and Risk Management (KIDS) Korea Adverse Event Reporting System (KAERS) Database (KIDS-KD)

Furthermore, the annotated corpus contained a total of 81,828 entity labels and 45,107 relations related to the extraction of drug safety information (Tables S1 and S2, OSM). In total, 40.0% of ‘ADE,’ ‘Disease,’ and drug entities were labeled as ‘not occurred,’ while 17.1% of drug entities were labeled as ‘not concerned’ (Table S2, OSM). Among 49 relation types, 24 relations were used more than 500 times for annotation (Table S2, OSM).

4.3 Performance of Natural Language Processing (NLP) Models to Extract Drug Safety Information

The KAERS-BERT model outperformed all the other models including the KoBERT model in three NLP tasks to extract comprehensive drug safety information (Table 3). The F1-scores of the KAERS-BERT model were > 80% for the tasks of NER, and label classification for the ‘occurred’ entity label, which were 4.10–4.85 percentage points higher than the second-best performing KoBERT model (Table 3).

Table 3 Performance metrics (%) of baseline models by task

Even in cases where the KAERS-BERT model failed to correctly recognize entities, most of them were reasonably predictable (Table 5 and Table S4, OSM). For example, about half of misclassified ‘Drug compound’ entities (20.21%) were recognized as ‘Drug group’ (10.24%), which was still related to ‘Drug compound’ (Table 5). Furthermore, the NER performances on entities labeled as ‘occurred’ were generally not lower than those on entities labeled as ‘not occurred’ except the ‘Drug compound’ entity (Table S5, OSM). In ‘Drug compound,’ an F1-score on ‘occurred’ entities was 5.08% lower than that on ‘not occurred’ entities.

Finally, utilizing the NER model to extract drug safety information from ADE narratives led to an average increase of 1.64% in data completeness for mandatory data fields, which were individually checked for completeness before submitting ICSRs in KAERS (Fig. 3). Additionally, there was a 6.43% improvement in data completeness for non-mandatory data fields that were not individually checked. Overall, the average improvement in data completeness across all data fields was 3.24%.

Fig. 3
figure 3

Comparison of completeness in a mandatory and b non-mandatory data fields structured in the Korea Institute of Drug Safety and Risk Management (KIDS) Korea Adverse Event Reporting System (KAERS) Database (KIDS-KD) before and after extracting drug safety information from adverse drug event (ADE) narratives using natural language processing (NLP) models

4.4 Ablation Experiment

NER performances were better when using both randomly selected ADE narratives and those with the least-reported items than when using only randomly selected ADE narratives, i.e., random only, particularly when the sample size of the training dataset was sufficiently large (Fig. 4). When the KAERS-BERT model was trained using only 340 ADE narratives (M = 0), the NER performance on total entities was better when using the random only dataset (F1-score of 68.08%). However, when M became ≥ 340, the NER performances on total entities were consistently better when using drug product + random datasets than when using the random only dataset (Fig. 4a). Also, the NER performance on ADE entities was better when using datasets containing ADE narratives with the least-reported items than when using the random only dataset (F1-score of 65.70%) at M = 1700. The NER performance on ADE entities at M = 1700 was highest when using the drug product + random dataset (F1-score of 68.13%).

Fig. 4
figure 4

Named entity recognition (NER) performances of the KAERS (Korea Adverse Event Reporting System)-BERT (bidirectional encoder representations from transformers) model on total entities (a) and adverse drug event (ADE) entities (b) by the composition of training dataset. A random only dataset denotes a training dataset consisting of only (340 + M) randomly selected ADE narratives, while ADE + random, indication + random, drug compound + random, and drug product + random datasets represent training datasets consisting of 340 ADE narratives reported with the least reported ADE, indication, drug compound, drug product items plus M randomly selected ADE narratives, respectively

5 Discussion

We successfully created an annotated corpus to extract comprehensive drug safety information from ADE narratives reported through SRS, and proposed well-performing baseline models for various NLP tasks. Furthermore, we pretrained the KAERS-BERT model specialized in clinical texts with frequent code-switching between English and Korea using 1.2 million ADE narratives. The KAERS-BERT model outperformed the KoBERT model on all NLP tasks (Table 3). All of those were possible because a thorough and consistent annotation guideline was adopted to extract drug safety information according to the standardized definitions of the data elements used in actual drug safety monitoring as seen in the ICH E2B(R3) standard. As a result, annotation quality was appropriately maintained (Table S4, OSM).

While previous NLP corpora to extract drug safety information have rarely attempted to capture information other than ADE occurrence and drug dosing information [9, 11, 12, 17, 18], our annotated corpus additionally covered other drug safety information helpful for pharmacovigilance, including the WHO-UMC assessment results and temporal information. In addition, the NER performance of the KAERS-BERT model is comparable to or even higher than those of previous NLP models even though the numbers of used entity types were much larger in this study than in previous studies [9, 10, 17, 19]. The model extracting drug safety information from ADE reports in the VAERS [19] and the best performing model of the MADE 2018 challenge [17] detecting adverse drug events from EHRs achieved F1-scores of 67.35% and 82.9% on the NER, respectively. On the other hand, the KAERS-BERT model we developed resulted in an F1-score of 83.91%. Although our study used 19 entity types, more than double and triple the numbers used in the MADE 2018 challenge and the VAERS study (9 and 6, respectively), the KAERS-BERT model demonstrated significant improvement in recognizing word entities related to drug safety information, despite the challenge posed by a larger number of entity types.

The high performance of the KAERS-BERT model on the NER task might be due to three reasons. First, the annotation guideline satisfactorily distinguished each word entity used to express drug safety information from others (OSM section 1). Second, pretraining the BERT model using the large volume of ADE narratives might help the model to be tailored to the target domain of drug safety information extraction [37]. Last, ADE narratives mentioned drug safety information more explicitly than other clinical texts of ADEs because the main purpose of an ADE narrative was to describe an adverse event and the clinical settings of a patient experiencing the adverse event.

For example, drug safety information that is helpful to determine the causality between drug usage and ADE occurrence, such as whether a sign or symptom occurs before or after drug administration, is more explicitly described in ADE narratives than in other clinical texts. Indeed, the proportion of ‘ADE’ entities correctly predicted was much greater in this study (91.99%) than in the 2018 n2c2 shared task (58.7%) where ADEs and medication information were extracted from clinical notes [9]. This larger difference in ADE recognition performance was partly because ADE narratives point out ADEs more explicitly than other clinical texts.

The KAERS-BERT model performed well on the NER task but had some difficulty in accurately identifying individual entities (Table 4 and Table S4, OSM). However, most misidentifications occurred between similar word entities such as between ‘Drug compound’ and ‘Drug group,’ or between ‘Dose’ and ‘Dosing interval,’ so we anticipate that simple rule-based post-processing would be sufficient to correct them. Therefore, the final performance of the end-to-end system, which extracts drug safety information from free-text and incorporates it into the SRS database, is even higher than those of NLP models on the NER task. Additionally, we were initially concerned that NER performance on entities labeled as ‘occurred’ would be lower than on those labeled as ‘not occurred,’ as ‘not occurred’ entities tend to be listed in a similar way in ADE narratives, such as the most common ADEs associated with a concerned drug. However, contrary to our concerns, the NER performance on ‘occurred’ entities was comparable to that on ‘not occurred’ entities (Table S5, OSM).

Table 4 Entity recognition for 12 key word entities by the KAERS-BERT model

Furthermore, we showed that the NER performance was improved when adding ADE narratives with the least-reported items to training datasets (Fig. 4). We postulate that more diverse clinical settings and ADEs in the training dataset with the least-reported items added could improve the model performance. The model performances were improved as the sample size of the training dataset increased, and the performance improvement was larger when using training datasets containing ADE narratives with the least-reported items than when using the random only dataset. As we hypothesized, this finding indicates that diverse clinical settings in ADE narratives with the least-reported items could improve the model performance when the model became familiar with the dominant clinical settings represented in randomly selected ADE narratives. The NER performance was better when using the smallest training datasets than when using the random only dataset (Fig. 4). This finding was not totally unexpected in that the test dataset was also composed of randomly selected ADE narratives.

We expect that the baseline models we developed can improve the data quality of SRS by capturing the drug safety information left out when transmitting ICSRs to the SRS database. For example, we observed that the proportion of entities labeled as ‘Caused hospitalization’ among ‘ADE at the last observation’ was greater in the total ICSRs than in the annotated ADE narratives (Table S1, OSM). Also, ‘ADE’ entities normalized to psychiatric and immune system disorders were more frequent in the annotated ADE narratives than in KIDS-KD (Fig. 2). These differences may indicate that certain ADEs and clinical settings, for example, psychiatric or immunologic ADEs and hospitalization caused by ADEs, tend to be left out when transmitting ICSRs to the SRS database. Because the annotated corpus appropriately captured this information, the NLP models we developed will be capable of extracting the drug safety information left out untapped.

Our study had several limitations. First, we did not annotate parent-child drug safety information, which is critical to evaluate drug safety during pregnancy. Second, we did not provide baseline models for the entity normalization task and the classification tasks for entity labels except ‘occurred’ and ‘concerned,’ although we finished annotation for the tasks. Thirdly, the NLP model had a hard time in accurately recognizing certain word entities, such as ‘Date’ and ‘Test result,’ in ADE narratives, with an accuracy of < 70% (Table 4). Lastly, the high performances of the KAERS-BERT model do not guarantee that the model improves the data quality of drug safety information collected through SRS. The usefulness of the NLP model in extracting drug safety information depends on the post-processing module that inputs drug safety information into the SRS database based on the inference results of the NLP model. To address those limitations, an end-to-end system that could extract drug safety information from ADE narratives and incorporate it into the SRS database may demonstrate the utility of the NLP models in the real-world pharmacovigilance setting [8].

6 Conclusion

In conclusion, we defined the extraction of comprehensive drug safety information from ADE narratives reported through KAERS as a series of NLP tasks and successfully developed well-performing baseline NLP models for the tasks. Specifically, we developed the KAERS-BERT model suited for clinical texts written in Korean and English using 1.2 million ADE narratives collected through KAERS. Upon using the KAERS-BERT model to extract comprehensive drug safety information from ADE narratives, we observed an average increase of 3.24% in data completeness for data fields structured in KIDS-KD. The annotated corpus and the KAERS-BERT model can streamline pharmacovigilance activities and eventually increase their efficiency by improving the data quality of an SRS database.