FormalPara Key Points

Narrative clinical notes in electronic health records are frequently the only documentation of an occurred adverse drug event (ADE).

Natural language processing (NLP) can be employed to identify mentions of drugs and symptoms to facilitate detection of ADE mentions in clinical text.

While still an active area of research, progress is made in improving methods for NLP for ADE mention detection using advanced algorithms.

1 Introduction

Pharmacovigilance is a broad spectrum of activities that focus on identifying and preventing adverse drug events (ADEs), as well as understanding the risk factors and causes of ADEs when they do occur [2]. Relying on ADEs formally and spontaneously reported to the Food and Drug Administration (FDA) will inevitably lead to underestimating the risks imposed by medications [3]. In an effort to improve patient safety as well as mitigating risks, healthcare organizations have started implementing automated ADE detecting systems in electronic health records (EHRs) [4]. It has been long recognized that clinical narrative is the best source of information about suspected events related to medication [5]. While structured data in EHRs typically contain prescription and fill information for medications, as well as coded diagnoses, clinical narratives often provide descriptions of relationships between these concepts, such as a medicine prescribed to treat a condition or a side effect or ADE that may have occurred because of treatment. A wide variety of natural language processing (NLP) approaches have been previously explored in order to discover relationships between drugs and symptoms in EHRs as well as to learn about potential risks from biomedical literature [6, 7]. Despite progress in development of text processing techniques, clinical narrative continues to be an underutilized source of data for identifying unreported ADEs. Language variability as well as local environmental differences between different clinical settings limit adoption of NLP solutions across organizational boundaries [8].

Powerful machine learning algorithms based on deep learning, such as recurrent neural network (RNN) and convolutional neural network (CNN) that use pre-trained word embeddings [9, 10], have shown great results in their ability to capture complex relationships between concepts in text without effortful feature engineering [11]. RNN models have been accepted as the current state-of-the-art approach to labeling sequential data. While often high performing, training RNN is a computationally intensive process that takes time and possibly specialized hardware such as a graphic processing unit (GPU) [12]. Healthcare organizations as well as clinical research teams frequently lack the computational infrastructure needed for implementation of the latest text processing techniques, thus limiting their adoption [13]. Therefore, despite great advances in availability of high-performance computing infrastructures, it is essential to develop NLP systems that are fast, accurate, easily trainable in a new domain, and do not require specialized hardware.

We present a system that automatically identifies ADEs explicitly stated in clinical narratives as well as other information about patient drug treatments as submitted to the NLP Challenges for Detecting Medication and Adverse Drug Events from Electronic Health Records (MADE 1.0) and our methods for all three tasks of the challenge [1, 14].

2 Methods

2.1 Data

A research team at the University of Massachusetts (UMASS) Medical School organized a shared task to tackle the problem of accurate detection of adverse drug events in clinical narrative. Detailed description of the shared task is presented elsewhere [14]. The shared task organizers prepared a set of de-identified clinical notes from the UMASS hospital and manually annotated them with the following categories:

  • Drug, defined as any mention of a medication, including brand and generic names, as well as frequently used abbreviations.

  • Indication, defined as a symptom that is a reason for drug administration.

  • Adverse Drug Event (ADE), defined as a sign or symptom that resulted from a drug.

  • Other Signs, Symptoms and Diseases (SSLIF), defined as any other sign or symptom that is not directly related to any drug mentioned in the note.

  • Drug Frequency, defined as prescribed or suggested frequency of drug administration, such as ‘once per day’, or ‘as needed’.

  • Drug Dose, or Dosage, defined as the amount of drug administered at one time.

  • Drug Duration, defined as the length of time of a single prescription episode, such as ‘for 10 days’, or ‘for 2 weeks’.

  • Drug Route, defined as the mode of administering the medication, such as ‘oral’, or ‘intravenous’.

  • Severity, defined as the extent the disease or symptom affects the patient, such as ‘some’, or ‘severe’.

The annotated set also included relationships between different concepts (drugs, ADEs, indications, and signs and symptoms) that linked drug names to drug attributes (dose, route, frequency, and duration), drug names to ADEs, drug names to indications, and symptoms to severity.

The data set was split into two parts for training and testing of NLP systems and the sets were distributed to the participating teams at different times (see Table 1).

Table 1 Concept instance distribution in training and testing sets

2.2 System Design

In accordance with a well established approach, our NLP system has two main modules: (1) identifying mentions of drugs, drug attributes, and symptoms as they are mentioned in clinical notes; and (2) classifying relationships between concepts with a set of predefined labels [6, 15].

2.2.1 Named Entity Recognition (NER)

Named Entity Recognition (NER) is a fundamental task in NLP that focuses on discovering mentions of a limited set of concept types. The traditional approach to the task is to use a predefined dictionary and expert-driven syntactic and semantic rules [16]. In the absence of a comprehensive dictionary for broad categories, statistical and supervised learning methods have been widely employed. Sequence-based classifier algorithms allow incorporating contextual information into the classification model. Deep learning algorithms for sequence-based classification are becoming increasingly popular for clinical NER because they alleviate the need for manual feature selection [17]. The main limitation of using neural networks is that model optimization typically involves hundreds or thousands of training iterations to perform hyperparameter search and cross-validation. While computational intensity of deep learning algorithms is widely recognized, challenges of working with such algorithms are frequently dismissed by citing availability of specialized hardware such as GPUs. In practice, while GPU acceleration aids in training neural network models, such hardware may not be available in all development and deployment environments. As Domingos writes, “machine learning is not a one shot process of building a dataset and running a learner, but rather an iterative process of running the learner, analyzing the results, modifying the data and/or the learner, and repeating” [18]. When deciding on an appropriate algorithm, system designers have to consider the balance between the expert time and computational time. Due to the timing and resource constraints, we have selected to use a less computationally intensive algorithm and focus on feature engineering. Conditional random field (CRF) is a supervised machine learning classification algorithm that is simpler and, thus, faster training than RNN, but has the potential to perform well with minimal feature engineering [19]. The purpose of the constructed NER module was to label each token identified in clinical documents as one of the categories listed in Sect. 2.1.

Feature engineering is often a major step in building machine learning applications. One approach to representing words in text as numeric vectors is called word embeddings. The main benefit of word embeddings is that a model can be created from a large set of unlabeled documents and then reused for a variety of use cases [20]. For our system we used two sets of pretrained word embeddings as the basis for features. One set was trained as a continuous bag of words from public sources and nearly 100,000 EHR notes [21, 22]. Another set was trained as skip gram without any EHR data [23]. These sets are referred to as EHR and NoEHR embeddings in our system design description.

Following a previously described approach [20, 24,25,26], we included word embedding as cluster features rather than continuous values. The word embeddings vocabulary contained over 5 million features, therefore, we trained clusters with Mini-batch K-Means to work within available memory [27]. In addition, we included multiple cluster sizes (K = 500, 5000, 10000) and compound cluster features formed from token bigrams (e.g., “Cluster17_Cluster22”) to capture generalizable phrases as opposed to strict bigrams as suggested by Guo et al. [20].

To identify known medications, the system included a lexicon of drug names using resources from MedEx [28]. This lexicon was then used to perform term matching in both local windows and in the entire sentence context.

The NER processing contained the following steps:

  1. 1.

    Sentence splitting using both a limited set of custom regular expressions and the Natural Language Toolkit (NLTK) [29].

  2. 2.

    Tokenization and part of speech (POS) labeling using NLTK.

  3. 3.

    Detecting mentions of known drug names using a lexicon developed from MedEx resources.

  4. 4.

    Building feature vectors from a variety of features.

The set of features included in the final model are as follows:

  • Local features (window = 2 tokens):

    • Token, stem, POS tag

    • Patterns of capitalization, digits, and punctuation

    • Prefix and suffix characters (n = 2, 3)

    • Embedding clusters from unigrams and bigrams

    • Drug lexicon match

  • Sentence features:

    • Drug lexicon match to the left or right of the current word

We utilized an annotated set of 876 clinical notes provided by MADE 1.0 organizers for training a CRF model for the NER module of the ADE detection system. The CRF model was trained using CRFSuite via the sklearn-crfsuite package available for scikit-learn [30, 31].

2.2.2 Relation Extraction (RE)

Once entities are detected in clinical documents, appropriate entities have to be linked in a relationship that represents the connection between these entities. For our system, relationships had to be identified between drug names and drug attributes—duration, route, frequency, and dosage; between drug names and symptoms that they caused—ADEs (labeled as Adverse), or that are reasons for prescription—indications (labeled as Reason); and between symptoms and severity concepts (labeled as Severity). Building the RE module was treated as a traditional supervised classification problem. We utilized features suggested by [32, 33]. Specifically, we extracted three types of features:

  • Candidate Entities: Information about pairs of entities being considered for a relation:

    • Entity types

    • Entity word forms

  • Entities Between: Other entities that appear between candidates

    • Entity types

    • Number of entities

  • Surface Features: Tokens and POS tags between and neighboring the candidate entities

    • N-grams (n = 1–3)

    • Window size (1–3)

    • Number of tokens.

We divided RE into two subtasks: first, relation detection, which is a binary classification of whether any sort of relation exists between two entities; and second, relation classification, in which we classify what specific relation type exists [34]. Using a binary model for the first subtask helps to remove a number of false relations and improves classification precision. The multi-class classifier used by the second subtask is applied to all candidate pairs that were predicted to have a relation. Both classifiers are random forest models implemented in scikit-learn [31].

2.2.3 Full System

The integrated system combined NER and RE into a single pipeline with no additional processing. Source text is processed by the NER system preparing documents in BioC format [35], which the RE system augments with predicted relations.

3 Results

The NLP system validation was performed against the evaluation set provided by the MADE 1.0 challenge and our system performance was compared with performances of other submitted systems. The MADE 1.0 challenge defined a distinction between ‘standard’ and ‘extended’ resources employed by designed systems. Standard resources included only data resources provided by the challenge organizers, which were the EHR trained word embeddings. Any other resources could be used in the system design as extended. The final challenge results were initially reported on standard resources only; however, we also share findings when additional resources were used (e.g., NoEHR embeddings and MedEx) to illustrate how these resources improved performance.

The challenge was organized as three tasks: (1) NER, (2) RE, and (3) full system. The evaluation set contained 213 annotated documents that were used to obtain the validation results. Final performance of each model was evaluated separately and then combined on the evaluation. The micro-averaged F1 score was 80.9% for NER, 88.1% for RE, and 61.2% for the final system. During development and system training, a hold-out set containing 20% of the training data was used to evaluate the feature contribution for the RE model as well as detailed error analysis for the two main modules. The full evaluation set was used to evaluate the NER model feature contribution and error analysis.

3.1 Named Entity Recognition (NER) Results

Overall and per-label performance for our optimal NER model is presented in Table 2 while Table 3 summarizes the contributions from each feature class in the NER model. Performance was lowest on the ADE and Indication labels where recall was much lower than in the other classes.

Table 2 Performance metrics of the CRF NER model on the 213 final evaluation documents
Table 3 Contribution of NER model features by strict (exact text match) micro-averaged metrics

Besides optimizing for F1, one of our objectives in using a CRF model was to allow rapid development of features and reduced training times. Wall time on CPU for extracting features for over 800 documents was measured at 2.5 min. In the system submitted to the MADE challenge, the optimizer for the training algorithm was L-BFGS [36].

Following the challenge, we corresponded with some of the top performing teams on the NER task. Since many of them used some form of RNN, we wanted to compare the time required to train our respective models. One example of training time comparison between our CRF model and the top performing system is shown in Table 4.

Table 4 Comparison of training time between our system and the top performing submission in the NER task

While the top performing team reported that one training fold of their model required approximately 4 h on GPU, we make a rough comparison by estimating the range of CPU training time from reported figures of 2 × to 15 × increase in training time [12, 38].

3.2 RE Results

Per-label performance using the final 213 evaluation documents for the RE model is shown in Table 5. Performance was lowest on ‘Adverse’ and ‘Reason’. We performed additional analysis using an initial hold-out set of 176 documents from the training set. The contribution of each feature set is shown in Table 6.

Table 5 Performance metrics of the relation extraction model on the final 213 evaluation documents
Table 6 Contribution of features for the relation extraction model using a hold-out set of 176 documents

3.3 Integrated System Results

The results for the final system are shown in Table 7.

Table 7 Micro-averaged performance metrics of the final integrated model on the final 213 evaluation documents

3.4 Error Analysis

As the overall micro-averaged F1 score of the NER is relatively similar to the performance of other submissions, an error analysis was performed on the false negatives and positives on the ADE and Indication labels to categorize its incorrect predictions. We have identified the categories of errors starting with the most common in Table 8. Table 9 outlines the main categories of errors found when evaluating accuracy of relationship classification.

Table 8 Error analysis from NER predictions related to ADE and indication labels
Table 9 Error analysis on relation extraction errors from a hold-out set of 176 documents

4 Discussion

Table 10 shows our final F1 scores on the 213 evaluation documents as reported by MADE 1.0 organizers using standard resources only. Table 11 shows the scores of our original test submissions alongside the three highest-performing submissions. Our system was ranked first in Task 2 and third in Task 3. Our scores have improved in each task since test submission. The score for the NER system was improved by incorporating extended resources that were not considered in reporting of top submissions. The score for the RE system was improved by fixing an error with sampling techniques during training.

Table 10 Final evaluation scores for each task
Table 11 F1 scores reported by the MADE 1.0 organizers of the original test submissions

A useful contribution of our approach to building an NER model is that it can be trained relatively quickly on commonly available hardware compared with neural network approaches. Noting that training times in Table 4 reflect a single fold of training, model optimization is clearly a highly computationally intensive task. Completing model optimization requires either long running processes on a single compute node or resources such as compute clusters or cloud computing services. Our findings suggest that employing deep learning techniques can be prohibitively expensive for a smaller research team with a short deadline. Using rapid feature engineering and training, we were able to quickly evaluate if our development efforts were successful.

Feature contribution shows that the NER model benefited from feature engineering including usage of a drug lexicon. Additionally, embedding cluster features improved performance where the optimal performance was achieved by employing both sets of pretrained embeddings, even though one embedding set did not include EHR documents in its training corpus.

The RE system performed best on categories such as ‘Route’, ‘Frequency’, and ‘Dose’, which are relatively simple statements that connect two entities that are often in close proximity in the text. The more challenging categories such as ‘Reason’ and ‘Adverse’ are often more linguistically complex and may involve some inference to understand that the two involved entities are connected. These categories will benefit from a more thorough analysis.

Feature engineering was an important component of the RE system. Of the three base feature sets that we considered, the surface features were by far the highest performing on the hold-out validation set. Although using only information about the entities being considered had a fairly low performance, adding information about what kinds of entities occur between them boosted performance considerably and resulted in a fairly competitive score. Using the union of all three resulted in the highest score.

The final integrated system combined both the NER and RE systems. The performance was significantly lower than the RE system using annotated documents, which shows the challenge of the integrated task.

Despite competing against more powerful and more computationally intensive approaches implemented by other submitted systems, our system achieved comparable results. The RE model was the top performing model in its task and the final system placed among the top three submissions.

4.1 Limitations

One limitation of the NER system is that the model assigns labels to SSLIF, ADE, and Indication in a single phase. Since these labels are all signs and symptoms with differing causality with respect to drugs, one possible improvement would be to establish a two-stage process. The first stage would combine the labels SSLIF/ADE/Indication into one label so that a second stage could disambiguate between these. Since the context window of the current CRF implementation is limited, the second stage could be a separate classifier that would use much more context than a single sentence to determine which label is the most appropriate. Features for this phase could include the current set but the features used in the RE system could also provide benefit. It would be interesting to see how such a staged architecture would perform compared with RNN models, which were the top performers in the challenge. Since feature engineering remained minimal, the CRF model would likely benefit from additional feature engineering for ADEs related in previous work [32]. One final limitation of the NER system that was seen during error analysis was that the current sentence detection algorithm was imperfect and often divided documents into sentences that were too small. Improved sentence breaking might particularly ameliorate the performance of ADE and Indication labels, as the current implementation limited the context available to one sentence, which was often far too short in size.

One other limitation of the integrated system was that we did not adjust either of the components (NER or RE) when combining them. Future work could focus on additional processing to improve the results when using both systems together.

Finally, since this challenge was conducted on a set of notes from oncology patients, it is unclear how well these models might generalize for pharmacovigilance in other medical domains. In future work, we intend to evaluate these models in the Department of Veterans Affairs to determine how well this work may translate to improving outcomes.

5 Conclusion

Automatic detection of adverse drug events can potentially have a profound effect on patient safety and accurate drug risk assessment. We developed a natural language processing system that can be retrained and applied in a new clinical setting without the use of specialized hardware, while still achieving performances comparable to more computationally intensive algorithms without requiring extensive feature engineering. Future work will include additional features and testing the system on a new dataset and in a new environment.

Source code for the NER system including feature extraction methods available at https://github.com/burgersmoke/MADE-CRF.