Background

Vast amounts of biomedical information are only offered in unstructured form through scientific publications. It is impossible for researchers or curators of biomedical databases to keep pace with all information in the growing number of papers that are being published [1, 2]. Text-mining systems hold promise for facilitating the time-consuming and expensive manual information extraction process [3], or for automatically engendering new hypotheses and fresh insights [4, 5].

In recent years, many systems have been developed for the automatic extraction of biomedical events from text, such as protein-protein interactions and gene-disease relations [2, 6]. Relatively few studies addressed the extraction of drug-related adverse effects, information which is relevant in drug research and development, healthcare, and pharmacovigilance [7]. The reason that this subject has been studied less frequently may in part be explained by the scarcity of large annotated training corpora. Admittedly cumbersome and expensive to construct, these data sets are nonetheless essential to train the machine-learning based classifiers of most current event extraction systems. Relation extraction systems typically perform two tasks: first, they try to recognize the entities of interest, next they determine whether there are relations between the recognized entities. In many previous studies, system performance evaluation was often limited to the second, relation extraction task, and did not consider the performance of the entity recognition task.

In this study, we describe the use of a knowledge base to extract drug-adverse effect relations from biomedical abstracts. The main advantage of our system is that it needs very little training data as compared to machine-learning approaches. Also, we evaluate the performance of the whole relation extraction pipeline, including the entity recognition part.

Related work

To extract biomedical relations from unstructured text a number of approaches have been explored, of which we mention simple co-occurrence, rule-based, and machine-learning based techniques.

The simplest approach is based on the co-occurrence of entities of interest. It assumes that if two entities are mentioned together in the same sentence or abstract, they are probably related. Typically, this approach achieves high recall, but low precision [8]. Since co-occurrence approaches are straightforward and do not involve linguistic analysis, their performance is often taken as a baseline to gauge other methods [9, 10].

Rule-based techniques are also a popular method for relation extraction. The rules are defined manually using features from the context in which the relations of interest occur. Such features may be prefixes and suffixes of words, part-of-speech (POS) tags, chunking information, etc. [1113]. However, the large amount of name variations and ambiguous terms in the text may cause an accumulation of rules [5]. This approach can increase precision, but often at the cost of significantly lower recall [14].

Machine-learning approaches automatically build classifiers for relation extraction, using contextual features derived from natural language processing techniques such as shallow parsing, which divides the sentence into chunks [15, 16], or full dependency parsing, which provides a complete syntactic analysis of sentence structures [17]. The performance of these methods is usually good [1820], but they require annotated training sets of sufficient size. Also, processing time may be high [3].

Hybrid approaches that combine manual and automatic approaches have also become more popular in recent years [21, 22].

An example of a relation extraction system is JReX, developed by the JULIE lab [23]. JReX uses a support vector machine (SVM) algorithm as its classifier. Originally developed for the extraction of protein-protein interactions, it was later adapted to the domain of pharmacogenomics. Using the PharmGKB database [24], JReX obtained F-scores in the 80% range for gene-disease, gene-drug, and drug-disease relations [25]. The Semantic Knowledge Representation (SKR) system [26], developed by the National Library of Medicine, provides semantic representations of biomedical text by building on resources currently available at the library. SKR applies two programs, MetaMap [27] and SemRep [28], both of which utilize information available in the Unified Medical Language System (UMLS) [29]. SKR has been used for concept-based query expansion, for identification of anatomical terminology and relations in clinical records, and for mining biomedical texts for drug-disease relations and molecular biology information [30]. Java Simple Relation Extraction (jSRE) is still another relation extraction tool based on SVM. It has been used for the identification and extraction of drug-related adverse effects from Medline case reports [31, 32], achieving an F-score of 87% on the ADE corpus [33]. It should be noted that this high performance value was obtained on a selected set of sentences that contained relatively many drug-adverse event relations. A framework that integrates nine event extraction systems is U-Compare [34]. The U-Compare event meta-service provides an ensemble approach to relation extraction, where the combination of systems may produce a significantly better result than the best individual system included in the ensemble [34]. Hybrid approaches that combine different techniques have also been shown to perform well. Bui et al. [35] proposed a novel, very fast system that combines natural language processing (NLP) techniques with automatically and manually generated rules, and obtained an F-score of 53% on the Genia event corpus [36], a result that is comparable to other state-of-the-art event extraction systems.

Most of the existing relation extraction systems use machine-learning algorithms and require an annotated corpus for training. There are several publicly available biomedical text corpora with manually annotated relations, for instance the corpora generated as part of the Biocreative [3739] and BioNLP [40, 41] challenges, the GENIA event corpus [36], PharmGKB [24], and the ADE corpus [33]. Most of these corpora focus on protein-protein interactions or other bio-events, while only two address drug-disease relations (PharmGKB) or drug-adverse effect relations (ADE corpus). As some of the annotations in PharmGKB have been reported to be hypothetical [42], we chose to use the ADE corpus as the gold standard corpus (GSC) for our experiments.

Methods

Corpus

The ADE corpus is originally based on 2972 Medline abstracts of case reports that were manually annotated for adverse drug effects [33]. The case reports were selected by a PubMed query with the MeSH (Medical Subject Headings) terms “drug therapy” and “adverse effect”. Only the sentences that contain at least one adverse drug effect have been made available by the corpus developers. The ADE corpus consists of 4272 of these sentences, taken from 1644 abstracts. The sentences contain annotations of 5063 drugs, 5776 conditions (diseases, signs, symptoms), and 6821 relations between drugs and conditions representing clear adverse effect occurrences [33]. Each relation consists of a Medline identifier, the sentence that contains this relation, the text and position of the drug, and the text and position of the adverse effect. Relations were only annotated if they occur in a single sentence. Drugs and conditions were not annotated if they were not part of an adverse event relation. We divided the 1644 abstracts that have sentences in the ADE corpus, into two sets: a small training set of 50 randomly selected abstracts, and a test set with the remaining abstracts (Table 1). Contrary to previous studies [32], we used all sentences in the 1644 abstracts, both the 4272 “positive” sentences that contain at least one relation according to the gold standard, and 7560 “negative” sentences that do not contain a relation.

Table 1 Number of abstracts, relations, and sentences in the ADE corpus

Relation extraction system

The relation extraction system consists of two main modules: a concept identification module that identifies drugs and adverse effects, and a knowledge-base module that determines whether an adverse effect relation can be established between the entities that are found. All modules were integrated in the Unstructured Information Management Architecture framework [43].

We used the Peregrine system (https://trac.nbic.nl/data-mining/) as the basis of our concept identification system. Peregrine is a dictionary-based concept recognition and normalization tool, developed at the Erasmus University Medical Center [44]. It finds concepts by dictionary look-up, performs word-sense disambiguation if necessary, and assigns concept unique identifiers (CUIs). We used Peregrine with a dictionary based on version 2012AA of the UMLS Metathesaurus, only keeping concepts that belong to the semantic groups “Chemicals & Drugs” and “Disorders” [45]. Rewrite and suppress rules are applied to the terms in the dictionary to enhance precision and recall [46].

To further improve concept identification, we employed a rule-based NLP module that we previously developed and tested for disease identification [47]. Briefly, the NLP module consists of a number of rules that are divided into five submodules, which carry out coordination resolution, abbreviation expansion, term variation, boundary correction, and concept filtering. The rules combine the annotations of a concept normalization system, such as Peregrine, with POS and chunking information. The coordination module uses POS and chunking information to reformat the coordination phrase and feed the reformatted text into the concept normalization system for proper annotation of the concepts. The abbreviation module combines an abbreviation expansion algorithm [48] with POS and chunking information to improve the recognition of abbreviations. The term variation module contains a number of rules that adjust noun phrases and feed the adjusted phrase into the concept normalization system again, to check whether it refers to a concept. The boundary correction module contains several rules that correct the start- and end positions of concepts identified by the system, based on POS and chunking information. The concept filtering module consists of two rules that suppress concepts that were identified by the concept normalization system. One rule removes a concept if the concept annotation in the text has no overlap with a noun phrase because in our experience, most UMLS concepts in biomedical abstracts belong to a noun phrase, or at least overlap with it. The other rule removes a concept if it is part of a concept filter list. The NLP module was not modified for the current task except for the concept filter list, which was adjusted based on our training data.

The knowledge base is a graph representation of the information contained in the UMLS Metathesaurus and the UMLS Semantic Network. The UMLS Metathesaurus defines terms and concepts (CUIs), as well as relations between the concepts. Each relation has a relation type, e.g., “is-a” or “cause-of”. There are a total of 621 relation types in the UMLS Metathesaurus. The UMLS Semantic Network consists of a set of semantic types, i.e., broad subject categories that provide a categorization of all concepts represented in the UMLS Metathesaurus. The semantic types are connected by semantic relations.

The knowledge base is a three-tier hierarchical graph in which vertices represent terms, concepts, and semantic types, and the edges represent relations between concepts and between semantic types. At the lowest level are the terms, which are linked to concepts at the second level. Each concept is linked to one or more semantic types, which are situated at the highest level. The knowledge base has been implemented in a graph database (http://www.neo4j.org) and was populated with concepts (CUIs) and relations extracted from the UMLS 2012 AA release. In this study, we only used the relations at the second level, i.e., between concepts.

The edges that connect two concepts form a path, with a length equal to the number of edges. The distance between two concepts is defined as the length of the shortest path. Note that there may be multiple shortest paths, but there is only one shortest path length.

For each sentence in the corpus, we determined the distance in the knowledge base between the drugs and adverse effects that were found by the concept identification module. Only if the distance between a drug-adverse effect pair was less than or equal to a distance threshold, a relation was considered present. Based on our training set, we empirically found that a distance threshold of four gave best performance results.

Further reduction of false-positive drug-adverse effect relations was attempted by taking into account the type of the relations in the shortest paths between drugs and adverse events. In our training set, we counted the number of each relation type in the paths that resulted in false-positive and in true-positive drug-adverse effect relations. If for a relation type the ratio of the false-positive count plus one and the true-positive count plus one was greater than seven, we discarded any path containing that relation type. The value of seven was determined experimentally on the training set as yielding the best performance.

Performance evaluation

In the ADE corpus, including both the 4272 positive and 7560 negative sentences, drug-adverse effect relations are annotated at the sentence level by specifying the start and end positions of the drug and the adverse effect. We counted a relation found by our system as true positive if the boundaries of the drug and adverse effect exactly matched those of the gold standard. If a gold-standard relation was not found, i.e., if the concept boundaries were not rendered exactly by the system, it was counted as false negative. If a relation was only found by the system, i.e., the concept boundaries did not exactly match the gold standard, it was counted as false positive. Performance was evaluated in terms of precision, recall, and F-score. An error analysis was carried out on a sample of 100 randomly selected errors that were made by our relation extraction system.

Results

Performance of the relation extraction system

Table 2 shows the performance of the Peregrine baseline system on the test set of the ADE corpus, and the incremental contribution for each of the different modules. The baseline system had a high recall but low precision, yielding an F-score of 16.1%. Use of the NLP module more than doubled the F-score. Application of the knowledge base further improved the F-score by 12.6 percentage points. Relation-type filtering increased the F-score by another 4.3 percentage points. Overall, the knowledge-base module decreased recall by 8.1 percentage points, but increased precision by 17.0 percentage points.

Table 2 Performance (in %) of the baseline relation extraction system and the incremental contribution of different system modules, on the test set of the ADE corpus

Effect of different distance thresholds in the knowledge base

Table 3 shows the performance of the relation extraction system on the ADE test corpus for different distance thresholds (the maximum allowed length of the shortest path between a drug and an adverse effect) in the knowledge base. The highest F-score of 50.5% is obtained with a distance of four. Lowering the distance threshold increases precision and decreases recall. The highest recall is 76.5% (precision 37.0%) at a threshold of five, the highest precision is 43.2% (recall 1.6%) at a threshold of one.

Table 3 Performance (in %) of the relation extraction system on the test set of the ADE corpus for different distance thresholds in the knowledge base

Effect of different training set sizes

To assess the effect of increasing amounts of training data on system performance, training sets of 100, 200, and 400 abstracts were selected from the ADE corpus. The abstracts in a training set were a subset of the abstracts in the next larger training set. For each training set, the corresponding test set consisted of the remaining abstracts in the ADE corpus. Table 4 shows that the performance of the relation extraction system improves with larger amounts of training data, but is leveling off with increasing size. The system obtains an F-score of 54.3% when trained on 400 abstracts, which is an improvement of 3.8 percentage points as compared with the system trained on 50 abstracts The NLP module contributed 1.7 percentage points to this improvement, and the relation-type filter module 2.1 percentage points. The baseline Peregrine module and the knowledge-base module do not require training and thus were not changed.

Table 4 Performance (in %) of the relation extraction system on the test set of the ADE corpus for different sizes of the training set

Performance comparison of knowledge based and machine-learning based relation extraction

Part of the ADE corpus that we used in our experiments, has previously been used by Gurulingappa et al. [32] to develop and evaluate a machine-learning based relation extraction system based on jSRE. To enable a comparison of the performance of our knowledge-based relation extraction system and the previously published results for jSRE, we set up the same training and test environment as described by Gurulingappa et al. [32]. Similar to Gurulingappa et al., we removed 120 relations with nested annotations in the gold standard (e.g., “acute lithium intoxicity”, where “lithium” is related to “acute intoxicity”), and only used the positive sentences in the ADE corpus. In [32], all remaining true relations (taken from the gold standard) were supplemented by false relations (taken from co-occurring drugs and conditions that were found by ProMiner [49], a dictionary-based entity recognition system), in a ratio of 1.26:1. To create a corpus with the same ratio to train and test our system and allow comparison of results, we took all true relations in which the concepts were found by Peregrine and the NLP module, and randomly added false co-occurrence relations generated by Peregrine and the NLP module, until the ratio of 1.26:1 was reached.

Table 5 shows the performance of our knowledge-base system and the previously reported performance of jSRE [32]. Without any training corpus, i.e., only applying the knowledge base but not the relation-type filtering, which requires training, our system already got an F-score of 88.5%. Additional use of the relation-type filter trained on small sets of 10 or 50 abstracts, resulted in slightly higher F-scores, which were substantially better than those obtained with jSRE. The best F-score reported for jSRE, when about 90% of the abstracts in the corpus was utilized for training, was 87% [32].

Table 5 Performance (in %) of a machine-learning based (jSRE) relation extraction system [[32]] and the knowledge-based system on a subset of the ADE test corpus (see text)

Error analysis

We randomly selected 100 errors that the system made in our test set, and manually classified them into different error types (Table 6). False-positive errors were mostly due to drugs and adverse effects that were correctly found by the concept identification module, but were wrongly annotated by the knowledge-base module as having a relation. Of the 64 errors of this type, 46 occurred in negative sentences, i.e., sentences that do not contain any drug-adverse effect relation according to the gold standard. For instance, the gold standard did not annotate a relation in “Norethisterone and gestational diabetes”, but the system found “norethisterone” as a drug concept, “gestational diabetes” as an adverse effect, and generated a false-positive relation between these two concepts. Eighteen of the 64 errors occurred in positive sentences. For instance, in the sentence “Pneumocystis carinii pneumonia as a complication of methotrexate treatment of asthma”, the gold standard annotated a relation between the drug “methotrexate” and the adverse effect “pneumocystis carinii pneumonia”, concepts that were also found by the system. However, the system also annotated “asthma” as another adverse effect concept, which generated a false-positive relation between “methotrexate” and “asthma”. The second type of false-positive errors was caused by incorrectly found concepts, for which a relation was found in the knowledge base. For instance, in “Drug-induced pemphigus related to angiotensin-converting enzyme inhibitors”, the system incorrectly annotated “angiotensin-converting enzyme inhibitors” as a drug, and wrongly established a relation with “drug-induced pemphigus”. Altogether, false-positive errors accounted for 79% of all errors.

Table 6 Error analysis of 100 randomly selected errors on the ADE test set

False-negative errors were generated because the system missed a concept, or did not find a relation in its knowledge base between two correctly found concepts. An example of the first type of error is the term “TMA” (thrombotic microangiopathy), which the system incorrectly recognized as a drug in the sentence “A case report of a patient with probable cisplatin and bleomycin-induced TMA is presented.” The system then missed the relations between the adverse effect “TMA” and the drugs “cisplatin” and “bleomycin”. The other type of false-negative error is illustrated by the sentence “Encephalopathy and seizures induced by intravesical alum irrigations”, which contains two relations, one between “alum” and “encephalopathy”, the other between “alum” and “seizures”. The concept-recognition module found all three concepts correctly, but the knowledge-base module could not find the relation between “alum” and “seizures”. False-negative errors contributed 21% to the total number of errors.

Discussion

We have investigated the use of NLP and a knowledge base to improve the performance of a system to extract adverse drug events. By applying a set of post-processing rules that utilize POS and chunking information, and exploiting the information contained in the UMLS Metathesaurus and the UMLS Semantic Network, the F-score on the ADE corpus improved by 34.4 percentage points as compared to a simple co-occurrence baseline system. To our knowledge, this is the first study that uses a knowledge base to improve biomedical relation extraction.

The main advantage of our approach as compared to machine-learning approaches is the relatively small set of annotated data required for training. For the ADE corpus, we only used 50 abstracts (3% of the total corpus) to train our system. When we compared our system with a machine-learning system trained on a document set of the same size, our system performed substantially better. Although a machine-learning approach usually performs very well if trained on a sufficiently large training set, the creation of a gold standard corpus (GSC) is tedious and expensive: annotation guidelines have to be established, domain experts must be trained, the annotation process is time-consuming, and annotation disagreements have to be resolved [50]. As a consequence, GSCs in the biomedical domain are generally small and focus on specific subdomains. It should also be noted that even when most of the ADE corpus was used to train the machine-learning system, it did not perform better than our knowledge-based system.

It is difficult to compare the performance of our system with those of the many other relation extraction systems reported in the literature because of the wide variety of relation extraction tasks and evaluation sets. We also evaluated the performance of the whole relation extraction pipeline (similar to, e.g., [51, 52]), whereas other studies focused on the relation extraction performance under the assumption that the entities involved were correctly recognized [12, 32, 5355]. Moreover, previous systems were sometimes evaluated on a selected set of abstract sentences. As mentioned earlier, Gurulingappa et al. [32] mainly used positive sentences with at least one relation from the abstracts in the ADE corpus, and did not consider relations with nested entities. Similarly, Buyko et al. only used sentences with at least one gene-disease, gene-drug, or drug-disease relation in the PharmGKB database. Both systems obtained F-scores larger than 80%. In a comparable test setting, our system obtained at least as good results (F-score 89%), but in a more realistic test environment, which included the whole relation extraction pipeline and all sentences of the abstracts, performance dropped considerably (F-score 51%). This can largely be attributed to the additional false-positive relations in the negative sentences of the abstracts, decreasing precision considerably. Although our evaluation setting is more realistic, results may still be optimistically biased because our corpus only consisted of abstracts that contain at least one sentence that describes an adverse drug event. The inclusion of abstracts that do not describe adverse drug events would further reduce the system’s precision.

Our error analysis indicated that for the majority of errors the entities are correctly identified (72/100), the error being made in the knowledge-base module. A potential source of false-negative errors is that drugs and adverse events in the knowledge base have no relations with other concepts. However, only 2.8% of the 4700 unique concepts that were found in the ADE corpus did not have any relation. The median number of relations per concept was 22. To reduce the number of false-negative errors, we plan to extend the knowledge base by including relations mined from other drug-adverse effect databases, such as DailyMed [56], DBpedia [57], and DrugBank [58]. False-positive errors generated by the knowledge base may be decreased by including more strict filtering rules on the relation types. We also noted several general concepts, e.g., “patient”, “drug”, and “disease”, that are highly connected. Their removal may improve performance. Finally, we currently took all relation types as equally important and did not consider the plausibility of a path that connects two concepts. Development of a weighting scheme of different relation types and rules that check the plausibility of the possible paths may be able to better distinguish false from true drug-adverse effect relations.

Our system has several limitations. The system currently does not try to distinguish between drug-adverse event relations and drug-disease treatment relations. Further investigation of the relation types in the paths that connect drugs and conditions in the knowledge base may help to differentiate these two situations, but is left for future research. A second limitation is that the knowledge-base module, in order to establish a potential relation, requires concept identifiers as its input. Concept identification is generally considered more difficult than the recognition of named entities, which can serve as the input for machine-learning based relation extraction. Another, related limitation of the current system is that the UMLS Metathesaurus does not provide extensive coverage of genes and proteins. The incorporation of relations from other sources of knowledge, such as UniProt or the databases that are made available through the LODD (Linking Open Drug Data) project, may remedy this drawback.

Conclusion

We have shown that a knowledge-based approach can be used to extract adverse drug events from biomedical text without need for a large training set. Whether use of a knowledge base is equally advantageous for other biomedical relation extraction tasks remains to be investigated.