Background

Metadata are a fundamental feature of biomedical ontologies, describing a wealth of natural language information in the form of labels and descriptions [1]. Ontologies formalized in the Web Ontology Language (OWL) [2] implement metadata in the form of annotation properties, and these can be used to describe multiple natural language labels for a single term, forming a collection of words and phrases that humans use to signify the concept. Open Biomedical Ontologies (OBO) [3] and the Information Artifact Ontology (IAO) [4] define a series of conventional annotation properties that can be used for the expression of labels and synonyms. These features are widely used: an investigation of ontologies in BioPortal found that 90% of classes had a label associated with them [5]. For example, as of 2017 the Human Phenotype Ontology (HP) [6] contained 14,328 synonyms for 11,813 classes [7]. The labels associated with ontology terms constitute a controlled domain vocabulary [1].

The domain vocabulary makes ontologies a valuable resource for text mining, particularly in information retrieval and extraction tasks [8]. The natural language labels associated with ontology classes can be used to identify where a class is mentioned in text. Furthermore, association of entities described in text with ontologies enables their integration with other datasets annotated by the same ontologies, as well as caters to the application of ontology-based analysis techniques such as semantic similarity [9, 10], semantic data mining [11], machine learning [12], or clustering [13].

However, due to limitations on resources for expert curation of ontologies and the sheer scale of their contents, the labels obtainable from single ontologies are not exhaustive. Combined with the tendency for alternative presentation of semantically equivalent concepts in biomedical text [14], ontology labels are not always a good fit for text corpora that mention the entities described by ontology concepts [15]. By expanding the set of synonyms in an ontology, particularly with synonyms that provide a better fit for text corpora, the performance of natural language processing tasks that depend on them may be improved.

This potential is reflected by previous work in the field. One approach that used analysis of existing synonyms across ontology hierarchy to determine new synonyms reported an increase in performance for the task of retrieving articles from a literature repository [16]. Another rule-based synonym expansion approach to extending the Gene Ontology showed improved performance in concept recognition tasks [17]. A combined machine-learning and rule-based approach to learning new HP synonyms from manually annotated PubMed abstracts improved performance of an annotation task over a gold standard text corpus [18]. These methods combine label components with label components from upper level classes to generate additional candidate synonyms, and search text corpora to limit those to true synonyms.

Ontology-based annotation software such as OBO Annotator [19], ConceptMapper [20], and the NCBO Annotator [21] contain routines to consider rule-based morphological and positional transformations of terms to increase concept recognition recall. Parameters that control the use of these features have a strong influence on annotation performance [22]. Previous work has also investigated synonym acquisition and derivation for the purposes of improving the performance of lexical ontology matching and alignment tasks [23]. Outside of automated synonym generation, organised efforts have been made to manually extend an ontology’s synonyms for a particular purpose. For example, HP was expanded with layperson synonyms to enable its use in applications that interact directly with patients [24].

However, no work to our knowledge has considered linking different ontology classes for the purposes of vocabulary expansion. Many biomedical entities are described by several classes in equivalent or similar contexts across several ontologies. For example, terms describing hypertension exist in many ontologies and medical terminologies. The hypertension (HP:0000822) term describes the condition in the context of a phenotype, while hypertension (DOID:10763) from the Disease Ontology (DO) [25] describes it in the context of a disease (although the difference between a disease and a phenotype is disputed). Specific-disease or application ontologies also extend upon definitions provided by general domain ontologies. For example, the Hypertension Ontology (HTN) [26] extends the HP and DO hypertension classes, adding additional information including labels. Furthermore, the subtle distinctions between concepts that biomedical ontologies capture, including phenotype versus disease, do not necessarily influence many of the commonly applied text mining tasks, because these contexts share the same labels.

We hypothesise that because ontologies are constructed with different focuses, ontologies that define concepts describing the same real-world entities will contain different, but valid, synonyms for a particular context. These focuses consist in contexts, domain experts, and source material. By considering all of these terms, we can construct extended vocabularies that may improve the power of ontology-based text mining tasks.

In this paper, we describe and implement a synonym expansion approach that combines lexical matching and semantic equivalency to obtain new synonyms for biomedical concepts. The synonym expansion algorithm derives additional synonyms for a class by matching it with classes from other ontologies, making use of the AberOWL ontology reasoning framework [27]. We use the approach to extend several ontology vocabularies, and evaluate them both manually, and in an ontology-based patient characterisation task.

Results

The synonym expansion algorithm is available as part of the Komenti text mining framework, which is available under an open source licence at https://github.com/reality/komenti, while the files used for validation are available at https://github.com/reality/synonym_expansion_validation.

Algorithm

The synonym expansion algorithm, including the two matching methods and steps to prune candidate synonyms, is described below. The process is performed for every input class given (in this context, ‘every ontology’ is any of the ontologies that are included in AberOWL). An input class is any ontology class for which we want to obtain additional synonyms.

  1. 1

    Extract the labels and synonyms of any classes in any ontology with a label or synonym that exactly matches the first label of the input class.

  2. 2

    Run an equivalency query against every ontology using the Internationalised Resource Identifier (IRI) of the input class, extracting labels and synonyms for any classes returned.

  3. 3

    Of the candidate synonyms produced by the first two steps, discard any that were:

    • Defined in ontologies that were found to produce incorrect synonyms.

    • Have the form of a term identifier.

    • Contain the input class label as a substring.

The algorithm uses two different methods for identifying matching classes, specified in steps one and two above. Strict lexical matching is used to identify otherwise unlinked terms that contain a label which is the same as the first label of the input class. Only the first label for the input class is used, because we found that the additional labels and synonyms were more likely to match classes which had different meanings, and led to more incorrect candidate synonyms. Mapping terms across ontologies via shared labels or metadata is a well established technique used in ontology alignment [28].

Equivalency queries are used to obtain additional candidate synonyms from classes that are equivalent to the input class, but do not share the same first label. In OWL ontologies, classes are uniquely identified by their IRIs, and classes that share the same IRI are automatically considered equivalent by a reasoner. This can be used to match classes in the case that another use of the same class is not expressed with the same first label in another ontology, occurring due to ontologies becoming out of sync, or intentional omission of annotation properties in a referencing class. In addition, equivalencies between different classes can be directly asserted via axioms in an ontology, or can appear as the result of a logical inference. Since the classes are semantically equivalent, we can use the metadata, including labels, of the other class to refer to the original. To retrieve equivalent classes, the AberOWL API runs an equivalency query against each ontology in the repository, which uses the description logic reasoner to obtain a list of matching classes, which are used to contribute additional synonyms.

After the main matching stage, the set of labels is pruned down to remove incorrect values. Some ontologies include term identifiers as labels which cannot be exploited by text-mining applications. Therefore, candidate synonyms that contained a colon or underscore were removed. The algorithm also removes labels sourced from GO-PLUS [29], MONDO [30], CCONT [31], and phenX [32], because we found these ontologies consistently produced incorrect synonyms. Incorrect synonyms could be contained in these ontologies due to human error, or in the case of large meta-ontologies such as MONDO, as a result of algorithmic error in asserting equivalencies between phenotypes across species. We also removed labels that include the input label as a substring, as these add no value to concept recognition systems (as the smaller string would match, making the longer string redundant).

Ontology expansion

We applied the vocabulary expansion algorithm to all 9,908 subclasses of disease (DOID:4) in the Disease Ontology (DO). DO itself asserts 24,878 labels and synonyms for these classes. The expanded DO vocabulary contained 76,240 labels and synonyms. We also applied the algorithm to the 14,406 non-obsolete subclasses of Phenotypic abnormality (HP:0000118) in HP. HP itself asserts 29,805 labels and synonyms. The number of labels and synonyms following expansion was 54,765. Therefore, the algorithm found 24,960 additional synonyms for terms in HP.

For the DO term hypertension (DOID:10763), 28 labels and synonyms were found. 3 of these were from DO itself. The first two steps of the algorithm, which obtains candidate synonyms, found 70 synonyms not including the word ‘hypertension’ itself. Of these, 56 were obtained via lexical matching, and 14 by equivalency query. The sources of these synonyms are summarised in Table 1. After making the list unique, there were 28 labels and synonyms. Therefore, the algorithm found 25 new synonyms, that were not asserted in the original DO term.

Table 1 Source of the 70 non-unique synonyms found for the term hypertension (DOID:10763) per-ontology

In this example, there were no synonyms uniquely found via equivalency. However, if we use bradycardia as the input class, we can identify two new synonyms from PhenomeNET [35], bradyrhythmia and reduced heart rate, which were not otherwise obtained via lexical matching. This is because PhenomeNET establishes a semantic equivalency between decreased heart rate (MP:0005333) and bradycardia, which does not share its first label with the HP class.

Manual validation

To evaluate the correctness of synonyms in the expansion of HP, a clinical expert manually evaluated 866 novel synonyms found for 500 randomly selected terms. Table 2 summarises the results, which show a precision of 0.912. 195 terms were marked as ambiguous, in the case that the synonyms were in a foreign language or the clinician did not have enough expertise of the term to determine whether the synonym was correct. Of these, the vast majority (161) were non-English labels, while the remaining 32 were English language synonyms the clinician could not judge.

Table 2 Metrics for clinical expert validation of 866 generated synonyms for 500 terms

Annotation

To initially evaluate whether the extended vocabularies could lead to more annotations of biomedical text, which could lead to greater performance at information retrieval and extraction tasks, we annotated the text associated with 1,000 randomly sampled MIMIC-III patients. We built a vocabulary using all non-obsolete subclasses of Abnormality of the cardiovascular system (HP:0011025), and compared the number of annotations before and after vocabulary expansion using our method. HP asserts 2,205 labels and synonyms for these classes, while the expanded set of labels numbers 5,336. The results are summarised in Table 3.

Table 3 Amount of labels for Abnormality of the cardiovascular system (HP:0011025) before and after synonym expansion, and the amount of annotations made of text associated with 1,000 MIMIC patients with these vocabularies

Patient characterisation

While the annotation task showed that our method can lead to more annotations, this does not necessarily mean that those annotations were correct or informative. Indeed, the manual validation indicates that there is some level of error associated with the process. To identify whether the annotations were informative and useful, we evaluated how the increased number of annotations affected performance on a downstream task.

In particular, we evaluated whether the additional ontology annotations yielded by the vocabulary expansion process led to better performance in using semantic similarity calculated from those annotations to predict shared primary diagnosis within the MIMIC-III dataset [36]. We annotated a sample of 1,000 patient visits using classes from the Disease Ontology (DO) that contained cross-references to ICD-9, both before and after label expansion using the presented algorithm. We then used those annotations to calculate a measure of semantic similarity between the patient visits, and evaluated the rankings with respect to whether highly ranked patient visits shared the primary diagnosis ICD-9 code (which each patient visit is annotated with in MIMIC), including those we did not find through DO cross-references (and were therefore not annotated).

The semantic similarity approach allows us to match patients who share a primary diagnosis even if they are not annotated directly with that disease (in this case, if we did not have an ICD-9 mapping for that disease), under the assumption patients who share the same diagnosis will be more similar on the basis of auxiliary symptoms associated with the disease they share. If we can more effectively annotate patients with the conditions that we do know about (present in our annotation vocabulary), then we should be able to rank them together in a way that is better predictive of a shared primary diagnosis.

We used the mean reciprocal rank and the mean average precision to measure how well semantic similarity rankings predicted matching first diagnoses. The results of the ranking task are shown in Table 4, with the expanded vocabulary leading to an increased performance in both cases. To determine whether the result was significantly different, we used the Wilcoxon rank-sum test to compare the ranks of patient similarity pairs with matching first diagnoses, yielding a p-value of 0.0007063.

Table 4 Comparison of the annotations of texts for 1,000 randomly sampled MIMIC-III patient visits before and after expansion, and their associated performance with respect to how predictive semantic similarity scores calculated from the annotations were of shared first diagnosis

Discussion

The results clearly demonstrate that for two biomedical ontologies, our approach vastly increases the amount of labels and synonyms available for their terms. Using hypertension as an example, we demonstrated that a range of different ontologies contribute additional synonyms, leading to 25 new unique labels for the term. By leveraging these we can effectively enrich vocabularies for terms.

While we only manually validated a small subset of terms from HP, this indicated a fairly high precision for candidate terms. Through analysis of the false positives, we found that many of them were caused by errors in the ontologies that the synonyms were sourced from. For example, several synonyms for motor aphasia (HP:0002427) were marked as incorrect since they refer to dysphasia, including “Broca Dysphasia.” Aphasia and dysphasia are different conditions. The first refers to a partial loss of language, and the latter to a full loss of language. All of these incorrect synonyms were sourced from Aphasia, Broca (MESH:D001039) in MESH.

Though this is not reflected in the results, we also found during the development of the algorithm that certain ontologies produced consistently incorrect synonyms. Several of these ontologies are meta-ontologies, automatically constructed from several ontologies using alignment and integration methods, and it is possible that errors in that process were the cause of the incorrect synonyms. Certain annotation properties were also incorrectly detailed by the AberOWL API as being labels, such as europe pmc and kegg compound. Candidate synonyms defined by problematic ontologies or matching the list of annotation properties are automatically removed. Expansion of the list of ontologies discluded from the sources for labels might further improve the precision of the algorithm, but may potentially come at the cost of correct synonyms.

Furthermore, the manual validation revealed that many of the returned synonyms were in non-English languages. While OWL ontologies do allow for parameters that distinguish which language the property is in, AberOWL does not index them. Therefore, it is not currently possible to distinguish between English and non-English synonyms. These items were marked as ambiguous, and not counted in the overall precision. This could also be controlled partially by discluding additional ontologies from results. For example, WHOFRE is a non-ontology mapping of French vocabulary to UMLS. For any uses where a reduced vocabulary accuracy is not acceptable, the algorithm should be used as a candidate label generator, to be checked by a domain expert before further use.

We also demonstrated that our expansion of the HP vocabulary increases the amount of phenotype annotations produced for MIMIC-III patient visit text records. While we did not directly validate the correctness of these annotations, by necessity a time-consuming task, we explored whether the additional synonyms would improve performance in a downstream task in our final evaluation. This experimental evaluation showed a clear and significant increase for a patient stratification task over MIMIC-III, identifying shared first diagnosis via semantic similarity score derived from ontology annotations. This indicates that for certain tasks, our approach can increase the quality of entity characterisations gained by information extraction, and in turn the power of ontology-based analyses, even without manual validation of the produced labels.

Limitations and future work

The most important potential limitation of the algorithm itself is that it violates the notion that the IRI of a concept uniquely identifies it, rather than its name. This is due to the fact that OWL ontologies do not follow the unique name assumption. False positives, in theory, could be generated by a lexical match on a homonym, which then has different synonyms itself. We believe, however, that this effect should be limited in the case of a highly specific biomedical language. Furthermore, any such error would be most likely be mitigated by the dataset context limitation. For example, synonyms derived from different contexts, incorrectly associated with a medical concept, are unlikely to be present within clinical letters.

False synonyms could also be removed on the basis of a corpus search. For example, if a candidate synonym never, or, at least, rarely, appears in the same document as another label, used for this term across a literature corpus, it is possible that it refers to a different concept from a disjoint context. This could also be performed by analysing the metadata of text corpora. For example, if two terms are never, or, at least, rarely, associated with literature from the same journals, the same field, or the same content tags, it is possible they have different meanings. In a further study, we would investigate whether synonymy can be identified using word embeddings.

While equivalency returns fewer synonyms, and not necessarily many that are not also obtained by lexical matching, they can also be treated with a higher level of confidence. For this reason, using only this method could be considered as a parameter in the case that a higher accuracy is required.

Conclusions

We have demonstrated that an inter-ontology approach to vocabulary expansion is a powerful method for adding informative labels and synonyms to terms used in text mining. These synonyms are found with a fairly high precision, and led to a greater rate of document retrieval in clinical and literature settings. Most importantly, we have shown that the approach improves the power of an ontology-based characterisation and analysis of patients via clinical text.

Methods

All files described in the validation (excluding the MIMIC-III data files), along with the commands necessary to repeat the experiments are available at https://github.com/reality/synonym_expansion_validation/.

Algorithm

We implemented the algorithm as a module in the Komenti semantic text mining framework using the Groovy programming language [37]. It makes use of the AberOWL API [27] for label matching and semantic queries, documented at http://www.aber-owl.net/docs/.

OWL ontologies use a number of conventional annotation properties to define labels and synonyms. These span a range of confidence and degree of synonymy. In this paper, we consider frequently used annotation properties, summarised in Table 5. These are the annotation properties consolidated into the ‘synonym’ property by the AberOWL API. Another oboInOwl synonym, hasRelatedSynonym is excluded, because the labels provided by these synonyms are too imprecise.

Table 5 Summary of conventionally used annotation properties considered in this experiment

Manual validation

To evaluate the performance of the algorithm, we randomly selected 500 classes from the expanded version of HP for manual validation. Synonyms already asserted by HP were removed from the set, because they were already assumed to be correct, and would not contribute to measuring the performance of the synonym expansion algorithm. A clinical expert (WB) marked each synonym as correct, incorrect, or ambiguous. The expert was asked to answer correctly or incorrectly on the basis: “if a patient has synonym, would it also be true that they have original label?” Entries were marked as ambiguous if the synonym was in a different language, or the validator otherwise did not have enough knowledge of the phenotype to determine whether or not the synonym was correct.

Annotation

We used the Komenti semantic text mining framework, which implements Stanford CoreNLP’s RegexNER [38] to annotate 1,000 randomly sampled entries from the NOTEEVENTS table in MIMIC-III (MIMIC) [39]. MIMIC is a freely available healthcare database, containing a variety of structured and unstructured information concerning around 60,000 admissions to critical care services [36]. We annotated the sample with all subclasses of Abnormality of the cardiovascular system (HP:0011025), comparing the number of annotations before and after synonym expansion. This investigation was performed on 17/01/2020.

Patient characterisation

We sampled 1,000 patient visits from the MIMIC-III (distinct from those used in the annotation experiment). We then concatenated all text records for each patient visit from the NOTEEVENTS table into one text file, and pre-processed the text to remove newlines, improve sentence delineation, and lemmatise words. We also retained the primary diagnosis, which was the first listed ICD-9 code in the DIAGNOSES_ICD table. These codes are produced by clinical coding specialists, by examining the texts associated with the visit.

We limited the classes considered for our annotation vocabulary to those which DO contained a database cross-reference to ICD-9, of which there were 2,118. This was to reduce noise from terms not represented in ICD-9. We obtained the unexpanded and expanded synonyms for these terms on 08/07/2020. Both sets of labels were also lemmatised (both lemmatised and unlemmatised forms were used for annotation).

The Komenti semantic text-mining framework was used to annotate the text associated with each patient visit. As before, this made use of the CoreNLP RegexNER annotator [38]. Negated annotations were excluded using the komenti-negation algorithm [43]. We then used the set of terms associated with it to produce a semantic similarity matrix for patient visits, using the Resnik measure of pairwise similarity for each annotated term [10], normalised into a groupwise measure using the best match average method [9]. Information content was calculated using the probability of the term appearing as an annotation in the totality of the set of annotations [10]. The similarity matrix was computed using the Semantic Measures Library [44].

We evaluated the similarity matrix using mean reciprocal rank and mean average precision to measure performance in predicting shared primary patient diagnosis. A true case was considered to be whether a pair of patient visits had the same primary diagnosis (as per the MIMIC-III database). For mean average precision, we considered only the 10 most similar patients for each patient. The p-value was calculated using the built-in wilcoxon.test function of R version 3.4.4 [45].