Integrating Bio-ontologies and Controlled Clinical Terminologies: From Base Pairs to Bedside Phenotypes

Electronic Health Records (EHR) are inherently complex and diverse and cannot be readily integrated and analyzed. Analogous to the Gene Ontology, controlled clinical terminologies were created to facilitate the standardization and integration of medical concepts and knowledge and enable their subsequent use for translational research, ofﬁ cial statistics and medical billing. This chapter will introduce several of the main controlled clinical terminologies used to record diagnoses, surgical procedures, laboratory results and medications. The discovery of novel therapeutic agents and treatments for rare or common diseases increasingly requires the integration of genotypic and phenotypic knowledge across different biomedical data sources. Mechanisms that facilitate this linkage, such as the Human Phenotype Ontology, are also discussed.


Introduction
We are arguably entering the era of data-driven, personalized medicine, where electronic health records are considered the transformational force for measuring and improving the quality of clinical care and accelerating the pace of biomedical research [ 1 , 2 ]. Electronic Health Record (EHR) data, alternatively referred to as Electronic Medical Record (EMR) data, are broadly defi ned as electronic data that are generated, captured and collected as part of routine clinical care across primary, secondary, and tertiary health care settings. EHR data can be structured (i.e., recorded using clinical terminologies), semi-structured (e.g., laboratory test results), or unstructured (e.g., free text). EHR data present multiple opportunities that have the potential to transform medical practice and research across all stages of translation [ 3 -6 ].
Health care is an intrinsically multidisciplinary process and the care of patients, even within a single clinical specialty, intimately involves clinicians from a diverse set of other specialties (e.g., physicians, surgeons, radiologists, pharmacologists). Patient interactions often occur within distinct health care settings: some diseases are almost exclusively managed in primary care while acute manifestations are usually treated in secondary care. For chronic conditions, such as cardiovascular diseases, patients may have multiple interactions within primary and secondary care, and undergo assessments and diagnostic tests across both settings over long periods of time. The amount of EHR data being digitally generated and collected are thus vast and rapidly expanding but lack a common structure to facilitate their use, both for care across clinical settings but also for research, auditing, and other administrative purposes.
The purpose of this chapter is to provide a brief introduction to clinical terminologies for capturing and representing different aspects of clinical care in electronic health records. Firstly, contemporary terminologies for recording diagnoses, surgical procedures, lab measurements, and medication are described. Secondly, the main applications and challenges of using clinical terminologies are set out. Lastly, a potential pathway for integrating clinical terminologies with biological ontologies is illustrated through a case study in breast cancer.

Controlled Clinical Terminologies
Similar to bio-ontologies, such as the Gene Ontology [ 7 , 8 ], controlled clinical terminologies (Table 1 ) were created to facilitate the systematic capture, curation, and description of health carerelated concepts encountered during clinical care [ 9 ]. These can include but are not limited to diagnoses, symptoms, anatomical terms of location, prescribed medications, medical tests, surgical procedures, and laboratory measurements. Clinical terminologies are considered the conceptual core of clinical information systems and an essential tool for facilitating clinical data integration and reuse amongst disparate data sources. Initiatives such as the Open Biomedical Ontologies Consortium (OBO) [ 10 ] were founded to coordinate their evolution and alignment and provide a set of guidelines for creating and maintaining them with the aim of establishing an ecosystem of interoperable entities.
Several systematic literature reviews provide in-depth detail on their different aspects and characteristics [ 11 -16 ]. A brief description of some key terminologies is provided below. SNOMED-Clinical Terms (SNOMED-CT) [ 17 , 18 ] contains representations for over 300,000 health care-related concepts and is designed to capture and represent patient data for clinical care. It consists of four primary components that defi ne the structure of the recorded information: concepts, descriptions, relationships and reference sets. Concepts are the basic unit of describing health care-related information and are uniquely identifi ed, e.g., the Myocardial Infarction concept (id 22298006). All concepts have a unique

Diagnoses
Fully Specifi ed Name, a list of Preferred Terms (e.g., Myocardial Infarction), and Synonyms (e.g., Heart attack, Cardiac infarction) defi ned. Concepts are organized into an acyclic hierarchy of is-a relationships that enables multiple inheritance i.e. concepts can have Myocardial disease (id 57809008). SNOMED-CT contains terms for describing clinical fi ndings, symptoms, diagnoses, procedures, medication, devices and anatomical body structures. It provides a compositional syntax which allows multiple ontology terms to be combined in order to build composite terms to represent complex medical concepts, a process known as post-coordination. Signifi cant variation exists internationally with regards to SNOMED-CT adoption and implementation [ 19 ] and its use for research or routine clinical care. In the UK National Health Service (NHS), SNOMED-CT has been designated to become the standard clinical terminology to be used across the entire health care system by 2020. The International Statistical Classifi cation of Diseases and Related Health Problems (ICD) is a statistical classifi cation system maintained by the World Health Organization [ 20 ]. ICD encapsulates concepts for classifying diseases, signs and symptoms, abnormal investigation fi ndings, complaints, interactions with the health care system, social circumstances, and external causes of injury or disease. It maps health conditions to corresponding generic categories together with specifi c variations, assigning for these a designated alphanumeric code, up to six characters long. Major categories are designed to include a set of similar diseases (e.g., ICD chapter "I" encapsulates all diseases of the circulatory system). It is currently the most widely used statistical classifi cation system in the world with many countries developing their own extensions and modifi cations tailored to their local health care system (e.g., ICD-9-CM used in the USA [ 21 ]). The primary use case of ICD is to abstract EHR data by assigning unique codes to diagnoses and procedures. This process is known as clinical coding , and performed manually or algorithmically by specialist staff according to a prespecifi ed protocol. Coded data are then utilized for research [ 22 ], offi cial statistics [ 23 ], medical billing, and health care resource planning.
Clinical terminologies are used for describing surgical procedures, interventions, and investigations that patients undergo in hospitals, during in patient and outpatient interactions. In the USA, the American Medical Association maintains the Current Procedural Terminology [ 24 ] (CPT) and in the UK, the OPCS Classifi cation of Interventions and Procedures version 4 (OPCS-4) [ 25 ] is used by the National Health Service. Both terminologies are used to convey information with regards to procedures to physicians and clinical coders and are combined with diagnosis codes during the medical billing process. [ 26 -28 ] is maintained by the Regenstrief Institute and used for describing medical laboratory observations. LOINC facilitates the exchange of

Laboratory Measurements
information with regards to laboratory tests and results between health care providers, laboratories and public health agencies. LOINC terms correspond to a single test, panel, observation, or measurement and are uniquely identifi ed by a numeric code. Terms are formed of six parts: component (what is being measured), property (characteristics of what is being measured), time (measurement temporal information), system (observation context or specimen type), scale (scale of measure), and method (procedure used to obtain the measure).
RxNorm [ 29 ] is a US-specifi c terminology developed by the Library of Medicine for describing information about clinical drugs (defi ned as pharmaceutical products taken by patients with a therapeutic or diagnostic intent). It provides normalized names for all clinical drugs and links information about their active ingredient(s), strengths, form, and branded versions. RxNorm is widely used for recording drug information in patient health records, exchanging information between health care providers [ 30 ], personal medication records [ 31 ], and medication-related clinical decision support [ 32 ] and contains cross-references to other commonly used drug vocabularies.

Uses of Clinical Terminologies
While clinical terminologies are primarily used for the purposes of clinical data standardization and integration, the provision of a systematic and common language for describing health care concepts enables the subsequent use of EHR data for a diverse set of purposes, such as clinical research, auditing and billing. Adoption of clinical terminologies worldwide varies across health care settings and by purpose but diagnostic and procedural classifi cation systems are primarily used for medical billing purposes. This section will briefl y describe the opportunities and challenges of using EHR data and clinical terminologies.
EHR data are increasingly being linked and used for translational research [ 33 ] as they offer larger sample sizes at a higher clinical resolution [ 34 ]. A primary use-case of linked EHR data is to accurately extract phenotypic information (i.e., disease status), a process known as phenotyping [ 35 ]. Identifying cohorts of patients that share a common characteristic (e.g., have been diagnosed with hypertension or have abnormally high blood glucose measurements) enables researchers to use EHR data to perform large-scale clinical research studies at a lower cost compared to traditional bespoke investigator-led studies. EHR data have been used to examine disease aetiology in relation to clinical risk factors [ 36 , 37 ] or genotypic information [ 38 , 39 ], develop disease prognosis models [ 40 ], perform health outcome comparisons between countries

Opportunities
Integrating Bio-ontologies and Controlled Clinical Terminologies: From Base Pairs… [ 41 ], and facilitate pragmatic clinical trials [ 24 ]. Clinical terminologies are heavily used by deterministic rule-based algorithms curated by experts for identifying and constructing patient cohorts from raw EHR data but data-driven methodologies are increasingly being utilized [ 42 ]. Comprehensive reviews provide additional information on the use of clinical terminologies for other purposes such as annotating and accessing medical knowledge sources, data integration, semantic interoperability, data aggregation, and clinical decision support systems [ 43 -46 ].
Merging EHR data across sources becomes challenging due to the differences in the manner in which data are recorded. Each health care setting generates and records data for a particular purpose using the clinical terminology that is optimal in that specifi c context. For example, information in primary care can be recorded using SNOMED-CT whereas hospital morbidities would be recorded using ICD-10. This mismatch between the clinical terminologies used to record information leads to signifi cant challenges as information is recorded at varying levels of granularity across sources. Semantic mapping systems, such as the Unifi ed Medical Language System [ 47 ] (UMLS), can provide further details on the relationship between terms in each clinical terminology and facilitate the translation or integration of information across sources. However, direct one-to-one mappings might not always exist between terminologies leading to information loss due to insufficient resolution or confl icts between two sources where multiple potential mappings exist. These issues and their severity vary by clinical speciality and context but often require a set of rules to be created by users and manually applied in order to resolve them before the data can be used for research purposes. In cases of incomplete mappings, synonyms or adjacent terms in the clinical terminology might be used as a replacement term but that is assessed on a case-by-case basis.

Integrating Biological and Clinical Data
A key challenge in genomics is to understand and elucidate the phenotypic consequences of variation observed in the genotypic level. Even among Mendelian diseases, the association between genotype and phenotype is often complex. With the advent of next-generation sequencing methods, the focus is now shifting from generating genomic sequence data to effi ciently interpreting them.
From a clinical care perspective, diseases presented by patients can be phenotypically distinct and associated with a specifi c set of treatments, symptoms, investigative procedures and management strategies. From a molecular scientist's perspective however, it might be appropriate to group and analyze diseases that share a common biological pathway as a single entity in order to discover similarities

Challenges
in the way they manifest in different patient groups. Both of these viewpoints are valid, but as a direct consequence, data describing phenotypic and molecular properties are recorded in a different, and often incompatible, manner [ 48 ]. The problem is exacerbated in rare diseases where researchers are required to create larger cohorts of patients by pooling data across research consortia in order to increase the sample sizes and obtain accurate estimates of risk.
Increasing amounts of molecular function knowledge are being recorded in a hierarchical manner, using bio-ontologies such as the GO, which offer a rigid way to represent knowledge in a machinereadable manner, interoperable between different data sources and annotated [ 11 ]. Scientists aim to link and integrate this with phenotypic information in order to elucidate the genotype-phenotype relationship and facilitate the discovery of novel therapeutic agents and treatments for common or rare disorders. Ontologies such as the Human Phenotype Ontology (HPO) [ 49 , 50 ] and the Disease Ontology [ 51 , 52 ] were created to provide streamlined disease definitions by systematically combining the diverse and heterogeneous knowledge contained within clinical terminologies and other annotation sources under a single framework. These tools aim to provide researchers with a rich resource that semantically links diverse disease defi nitions from clinical terminologies and enables the linking of phenotypic, genotypic and genetic information of a disease.
The HPO is a structured, curated ontology describing phenotypic abnormalities and the relationships between them. The HPO aims to act as scaffolding for enabling the interoperability between molecular biology and human disease by providing a centralized resource for integrating genotypic and phenotypic data across biomedical sources. The HPO enables the computational analysis of human (and model organism) phenotypes against the background biological and molecular knowledge incorporated in biological ontologies such as the GO.
The HPO is organized as three independent sub-ontologies that cover different domains with the largest one being the one describing phenotypic abnormalities. The other two sub-ontologies describe the mode of inheritance and the onset and clinical course of the abnormalities. The primary focus of the HPO is not to capture diseases but rather the phenotypic abnormalities that are associated with them. Each HPO term describes a phenotypic abnormality (e.g., Primary congenital glaucoma ) and is assigned a unique persistent identifi er (e.g., HP:0001087 ). HPO terms are related to parent terms by "is a" relationships and terms can have multiple parent terms. The HPO is not primarily designed to capture and document quantitative information (e.g., systolic blood pressure, body mass index) but does provide qualitative descriptions of excess or reduction in quantity leading to a phenotypic abnormality (e.g., markedly reduced T cell function).
Interoperability between molecular and phenotypic data and research areas is accomplished through a comprehensive set of term

Human Phenotype Ontology
annotations. The majority of HPO terms contain a reference to the Unifi ed Medical Language System [ 47 ], enabling the mapping of terms between controlled clinical terminologies and other sources in the UMLS Metathesaurus. Additionally, HPO terms contain annotations that provide pointers to specifi c diseases or genes created in other external knowledge sources such as Online Mendelian Inheritance in Man (OMIM) database ( http://omim.org/ ), DECIPHER ( https://decipher.sanger.ac.uk/ ), and Orphanet ( http://www.orpha.net/ ). HPO annotations have a number of metadata fi elds associated with them for further specifying onset, frequency and quantifying modifi er effects. Annotations evidence codes, analogous to GO Evidence Codes, describe the manner in which a particular annotation was assigned to a term (e.g., inferred by text mining, traceable author statement, inferred from electronic annotation, public clinical study).
Using malignant neoplasms of the breast as a hypothetical case study, this section presents a potential pathway of linking biological knowledge on genotypic variation and molecular functions to clinical phenotypes encountered within the health care system. Drilling down from the right-hand side of clinical phenotypes down to the left-hand side of genotypic variation, Figure 1 illustrates details of all potential sources and annotation mechanisms used within each source to capture and record information.
Genotypic information : HPO annotations provide a cross-link to the Online Mendelian Inheritance in Man (OMIM) Breast Cancer, Familial phenotype entity (OMIM #114480-URL www.omim. org/entry/114480 ). OMIM provides curated lists of disease phenotypes and genes associated with that phenotype, in this case for example the BRCA2 gene entry (OMIM *600185-www.omim. Clinical phenotype : Oncology data in hospitals are stored in diverse locations and formats since diagnosis and treatment is a multidisciplinary process between pathology, radiology, surgery, medical oncology and radiotherapy. Breast cancer diagnosis and severity is usually evaluated through imaging tests such as mammograms, ultrasounds, magnetic resonance imaging or by performing a biopsy. Medical images and their associated metadata are stored in a picture archiving and communication system (PACS) system and information about these procedures and the results obtained would be recorded using intervention and procedure terms. Diagnosis and staging information would be stored and coded in pathology systems using a medical terminology such as SNOMED-CT or other bespoke data structures. Treatment data would be stored in the pharmacy information systems.

Conclusion
The amount of clinical data that are generated and captured during routine clinical care is increasing in size and complexity. Integrating clinical data from disparate sources however is a challenging task due to their lack of common structure and annotation. Similar to the Gene Ontology, controlled clinical terminologies have been created to facilitate the systematic capture, curation, and description of health care related events such as diagnoses, prescriptions and procedures from EHR data and enable their subsequent usage for clinical care, research, or administrative purposes. Furthermore, linking EHR data with biological knowledge is increasingly becoming possibly through tools such as the Human Phenotype Ontology (HPO) and the Disease Ontology that aim to provide the semantic scaffolding for computationally integrating biomedical knowledge across sources.
Funding Open Access charges were funded by the University College London Library, the Swiss Institute of Bioinformatics, the Agassiz Foundation, and the Foundation for the University of Lausanne.
Open Access This chapter is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http:// creativecommons.org/licenses/by/4.0/ ), which permits use, duplication, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, a link is provided to the Creative Commons license and any changes made are indicated. The images or other third party material in this chapter are included in the work's Creative Commons license, unless indicated otherwise in the credit line; if such material is not included in the work's Creative Commons license and the respective action is not permitted by statutory regulation, users will need to obtain permission from the license holder to duplicate, adapt or reproduce the material.