Large-scale drug safety surveillance and pharmacovigilance are key components of effective drug regulation systems, clinical practice, and public health programs [1]. Although the efficacy and safety of a drug must be demonstrated in a series of clinical trials prior to approval [2], many adverse drug events (ADEs) are detected only after a drug has been marketed when it is used by a larger and more diverse population than during clinical trials. Adverse drug events discovered after a drug is in broad use can be a significant cause of morbidity and mortality. Thus, effective and accurate post-market drug surveillance is in urgent demand for the protection of public health and the reduction of healthcare expenditures due to ADE-related hospital complications [3,4,5].

Spontaneous reporting systems [6,7,8,9] have been traditionally used for pharmacovigilance. However, this type of data is inherently passive because except for drug companies’ spontaneous reporting systems, reporting is voluntary, and studies have shown that as many as 90% of serious ADEs go unreported [10]. Electronic health records (EHRs) contain real-time real-world clinical data gathered during routine clinical care, offering a potentially more proactive approach to pharmacovigilance [2]. Therefore, EHRs for post-market surveillance play an important role in the new paradigm of drug regulation [11]. More importantly, compared with structured data or coded data in EHRs, unstructured clinical narratives provide more information on ADE documentation. A study shows that only 9020 (28.6%) out of 31,531 patients with documented statin side effects had the relevant ADE recorded in a structured format [12]. Therefore, developing advanced natural language processing (NLP) techniques to unpin ADE information from EHR narratives will greatly facilitate proactive, accurate, and efficient post-market drug safety monitoring on a large scale.

In 2010, i2b2 partnered with the VA Salt Lake City Health Care System and organized an NLP open challenge [13], which supported community efforts in applying NLP to extract medication and treat targets and caused adverse events from EHR narratives. However, the annotation schema defined for that challenge only covers a limited set of entities relevant for pharmacovigilance. To further/better assess the current methodological progress in this research area, we organized the “NLP Challenge for Detecting Medication and Adverse Drug Events from Electronic Health Records (MADE1.0)” in 2018, which offers larger scale, expert-annotated clinical notes labeled with more fine-grained clinically named entities and relations related to drug safety surveillance. There are 15 teams from seven countries registered in this challenge and in total 41 runs from 11 teams were submitted.

Part of this theme issue of Drug Safety is to present recent advances in mining unstructured information from clinical narratives in the context of drug safety surveillance and pharmacovigilance. There are five articles from the MADE1.0 challenge, including an overview paper and four research papers invited from top performance teams participating in the challenge.

The first paper by Jagannatha et al. [14] provides an overview of the MADE1.0 challenge. First, the article describes the MADE1.0 corpus, including the details about the annotation process and a comprehensive annotation schema. The authors report the Fleiss’s Kappa score of 0.628 and 0.424 for the inter-annotator agreement on named entity annotation and relation annotation, respectively. Second, the authors introduce the three subtasks defined in the challenge: Named Entity Recognition (NER), Relation Identification (RI), and Joint Relation Extraction (NER-RI), followed by a comprehensive report of system submissions for the challenge. Finally, an ensemble-based system aggregation has shown improved performance, suggesting that the top performing systems develop in a different but complementary manner.

Wunnava et al. [15] present a three-layer deep learning architecture for the NER subtask, consisting of a bi-directional long short-term memory (BiLSTM) layer for character-level encoding, a BiLSTM layer for word-level encoding, and a conditional random fields (CRF) layer for structured prediction. To better handle the noisy format of clinical notes, they built a rule-based sentence and word tokenizer leading to a better performance compared with using an off-the-shelf Natural Language Toolkit [16]. Their system achieved the best micro F1 score of 0.829 for NER, and they found character-level encoding and CRF sequence inference contribute to performance improvement.

Yang et al. [17] applied a similar BiLSTM-CRF structure for NER, which is combined with a support vector, machine-based relation extraction system to address all the three tasks. They trained BiLSTM-CRF in two stages: optimize parameters based on validation data and then train a final model using both validation and training data, leading to better results than using the validation-optimized model directly. Their experiments demonstrate that developing separate classifiers to handle intra- and inter-sentence relations respectively obtained better performance (F1 score of 0.8466) than one single classifier for both (F1 score of 0.8304).

Dandala et al. [18] employed two deep learning architectures in their challenge: a BiLSTM-CRF model for NER and a BiLSTIM model with an attention mechanism for RI. In addition to character/word level embeddings, part-of-speech embeddings were also utilized for input encoding in both models. Based on the observation that “adverse drug events” and “indications” entities have a semantic overlap with “other sign and symptoms”, they experimented with a joint modeling method where those three types of entities are first merged into one category for the NER model and their relations with medications determined by the RI model were in turn used to distinguish those three types of entities. Experimental results show the joint modeling approach outperformed the standard sequential model for the integrated NER-RI subtask (micro F1 of 0.653 vs. 0.624).

Peterson et al. [19] explored traditional machine learning models, CRF for NER and random forest for RI, which were shown more computationally efficient and thus easily deployed in real-world applications without depending on a special high-performing infrastructure. As part of the feature engineering effort, they included word embeddings as clustering features trained with Mini-batch K-means, in which multiple cluster sizes and compound cluster features were also examined. Compared with the counterpart deep learning models, their system achieved competitive overall results through effective feature engineering, yielding the best micro F1 of 0.8684 for the RI subtask.

While the performance reported in this challenge is promising, there is much room for further improvement, especially for the complex joint NER-RI task. The design of better learning algorithms and the availability of more labeled data are two important aspects contributing to improved system performance. Another future direction is to validate and increase the existing systems’ generalizability on larger scale datasets from diverse clinical subspecialties. That may require more effort in building annotated data as well as exploring effective domain adaption techniques for data-scarce subspecialties. Finally, it would be essential to investigate how to effectively integrate a large volume of diverse, dynamic, distributed structured or unstructured data from different sources such as spontaneous reporting system reports, EHRs, insurance claims, medical literature, and social media for collective ADE signal detection.

Data mining EHRs for drug safety surveillance, especially mining unstructured narratives through NLP, will remain an active research topic. The innovative approaches reported in this issue, which were motivated by the MADE1.0 challenge, will lay a solid foundation for further advancing methodological development and system deployment towards more intelligent drug safety surveillance.