Primer on Ontologies

As molecular biology has increasingly become a data-intensive discipline, ontologies have emerged as an essential computational tool to assist in the organisation, description and analysis of data. Ontologies describe and classify the entities of interest in a scientiﬁ c domain in a computationally accessible fashion such that algorithms and tools can be developed around them. The technology that underlies ontologies has its roots in logic-based artiﬁ cial intelligence, allowing for sophisticated automated inference and error detection. This chapter presents a general introduction to modern computational ontologies as they are used in biology


Introduction
Examining aspects of the world to determine the nature of the entities that exist and their causal networks is at the heart of many scientifi c endeavours, including the modern biological sciences. Advances in technology have made it possible to perform largescale high-throughput experiments, yielding results for thousands of genes or gene products in single experiments. The data from these experiments are growing in public repositories [ 1 ], and in many cases the bottleneck has moved from the generation of these data to the analysis thereof [ 2 ]. In addition to the sheer volume of data, as the focus has moved to the investigation of systems as a whole and their perturbations [ 3 ], it has become increasingly necessary to integrate data from a variety of disparate technologies, experiments, labs and even across disciplines. Natural language data description is not suffi cient to ensure smooth data integration, as natural language allows for multiple words to mean the same thing, and single words to mean multiple things. There are many cases where the meaning of a natural language description is not fully unambiguous. Ontologies have emerged as a key technology going beyond natural language in addressing these challenges.
The most successful biological ontology (bio-ontology) is the Gene Ontology (GO) [ 4 ], which is the subject of this volume.
Ontologies are computational structures that describe the entities and relationships of a domain of interest in a structured computable format, which allows for their use in multiple applications [ 5 , 6 ]. At the heart of any ontology is a set of entities, also called classes, which are arranged into a hierarchy from the general to the specifi c. Additional information may be captured such as domain-relevant relationships between entities or even complex logical axioms. These entities that are contained in ontologies are then available for use as hubs around which data can be organised, indexed, aggregated and interpreted, across multiple different services, databases and applications [ 7 ].

Elements of Ontologies
Ontologies consist of several distinct elements, including classes, metadata, relationships, formats and axioms.
The class is the basic unit within an ontology, representing a type of thing in a domain of interest, for example carboxylic acid , heart , melanoma and apoptosis . Typically, classes are associated with a unique identifi er within the ontology's namespace, for example (respectively) CHEBI:33575, FMA:7088, DOID:1909 and GO:0006915. Such identifi ers are semantics free (they do not contain a reference to the class name or defi nition) in order to promote stability even as scientifi c knowledge and the accompanying ontology representation evolve. Ontology providers commit to maintaining identifi ers for the long term, so that if they are used in annotations or other application contexts the user can rely on their resolution. In some cases as the ontology evolves, multiple entries may become merged into one, but in these cases alternate identifi ers are still maintained as secondary identifi ers. When a class is deemed to no longer be needed within the ontology it may be marked as obsolete, which then indicates that the ID should not be used in further annotations, although it is preserved for historical reasons. Obsolete classes may contain metadata pointing to one or more alternative classes that should be used instead.
Classes are usually associated with annotated textual informationmetadata. The metadata associated with classes may include any associated secondary (alternate) identifi ers and fl ags to indicate whether the class has been marked as obsolete. It may also include one or more synonyms; for example the synonyms of apoptotic process (a class in the GO) include cell suicide , programmed cell death and apoptosis . It further may include cross references to that class in alternative databases and web resources. For example, many Chemical Entities of Biological Interest (ChEBI) [ 8 ] entries

Metadata
contain cross references to the KEGG resource [ 9 ], which represents those chemicals in the context of the biological pathways they participate in. Textual comments and examples of intended usage may be annotated. It is very important that each class include a clear defi nition, which provides enough information to pinpoint the meaning of the class and suggest its appropriate use-sufficiently distinguishing different classes in an ontology so that a user can determine which is the best to use for annotation. The defi nition of apoptosis offered by the Gene Ontology is as follows: A programmed cell death process which begins when a cell receives an internal (e.g. DNA damage) or external signal (e.g. an extracellular death ligand), and proceeds through a series of biochemical events (signaling pathway phase) which trigger an execution phase. The execution phase is the last step of an apoptotic process, and is typically characterized by rounding-up of the cell, retraction of pseudopodes, reduction of cellular volume (pyknosis), chromatin condensation, nuclear fragmentation (karyorrhexis), plasma membrane blebbing and fragmentation of the cell into apoptotic bodies. When the execution phase is completed, the cell has died.
Classes are arranged in a hierarchy from the general (high in the hierarchy) to the specifi c (low in the hierarchy). For example, in ChEBI carboxylic acid is classifi ed as a carbon oxoacid , which in turn is classifi ed as an oxoacid , which in turn is classifi ed as a hydroxide , and so on up to the root chemical entity , which is the most general term in the structure-based classifi cation branch of the ontology.
Despite the hierarchical organisation, most ontologies are not simple trees. Rather, they are structured as directed acyclic graphs . This is because it is possible for classes to have multiple parents in the classifi cation hierarchy, and furthermore ontologies include additional types of relationships between entities other than hierarchical classifi cation (which itself is represented by is_a relations). All relations are directed and care must be taken by the ontology editors to ensure that the overall structure of the ontology does not contain cycles, as illustrated in Fig. 1 . Fig. 1 ( a ) A simple hierarchical tree, ( b ) a directed, acyclic graph, ( c ) a graph that contains a cycle, indicated in red A common relationship type used in multiple ontologies is part_of or has_part , representing composition or constitution. For example, in the Foundational Model of Anatomy (FMA) [ 10 ], heart has_part aortic valve . The Relationship Ontology (RO) defi nes several relationship types that are commonly used across multiple bioontologies [ 11 ], a selection of which is shown in Table 1 .

Relations
In addition, specifi c ontologies may also include additional relationships that are particular to their domain. For example, GO includes biological process-specifi c relations such as regulates , while ChEBI includes chemistry-specifi c relationships such as is_ tautomer_of and is_enantiomer_of .
The specifi cation for a relationship type in an ontology includes a unique identifi er, name and classifi cation hierarchy, as for classes, as well as a specifi cation whether the relationship is refl exive (i.e. A rel B if and only if B rel A) and/or transitive (if A rel B and B rel C then A rel C), and the name of the inverse relationship type if it exists. The same metadata as is associated with the classes in the ontology may also be associated with relationship types: alternative identifi ers, synonyms, a defi nition and comments, and a fl ag to indicate if the relationship is obsolete.
Typically, ontologies are stored in fi les conforming to a specifi c fi le format, although there are exceptions that are stored in custombuilt infrastructures. Ontologies may be represented in different underlying ontology languages, and historically there has been an evolution of the capability of ontology languages towards greater logical expressivity and complexity, which is mirrored by the advances in computational capacity (hardware) and tools. Biological ontologies such as the GO have historically been represented in the

Relationship type
Informal meaning Examples part_of The standard relation of parthood. A brain is part_of a body.
derives_from Derivation holds between distinct entities when one succeeds the other across a temporal divide in such a way that a biologically signifi cant portion of the matter of the earlier entity is inherited by the latter.
A zygote derives_from a sperm and an ovum.
has_participant A relation that links processes to the entities that participate in them.
An apoptotic process has_participant a cell.
has_function A relation that links material entities to their functions, e.g. the biological functions of macromolecules.
An enzyme has_function to catalyse a specifi c reaction type.
human-readable Open Biomedical Ontologies (OBO) language, 1 which was designed specifi cally for the structure and metadata content associated with bio-ontologies, but in recent years there has been a move towards the Semantic Web standard Web Ontology Language (OWL) 2 largely due to the latter's adoption within a wider community and expansive tool support. Within OWL, specifi c standardised annotations are used to encode the metadata content of bio-ontologies as OWL annotations. However, the distinction has become cosmetic to some extent, as tools have been created which are able to interconvert between these languages [ 12 ], provided that certain constraints are adhered to.
Within logic-based languages such as OWL, statements in ontologies have a defi nite logical meaning within a set-based logical theory. Classes have instances as members, and logical axioms defi ne constraints on class defi nitions that apply to all class members. For example, the statement carboxylic acid is_a carbon oxoacid has the logical meaning that all instances of carboxylic acid are also instances of carbon oxoacid: The logical languages underlying ontology technology are collectively called Description Logics [ 13 ]-in the plural because there are different variants with different levels of complexity. Some of the different ingredients of logical axioms that are available in the OWL language-quantifi cation, cardinality, logical connectives and negation, disjointness and class equivalence-are explained in Table 2 .
Like the carboxylic acid example above, each of these axiom types can be expressed as a logical statement. With these axioms, logic-based ontology reasoners are able to check for errors in an ontology. For example, if a class relation is quantifi ed with 'only' such as the hydrocarbon example given in the table, which in logical language means , that class will be detected as inconsistent and fl agged as such by the reasoner.
The end result-an ontology which combines terminological knowledge with complex domain knowledge captured in logical form-is thus amenable to various sophisticated tools which are able to use the captured knowledge to check for errors, derive inferences and support analyses.

Tools
Developing a complex computational knowledge base such as a bio-ontology (for example, the Gene Ontology includes 43,980 classes) requires tool support at multiple levels to assist the human knowledge engineers (curators) with their monumental task. For editing ontologies, a commonly used freely available platform is Protégé [ 14 ]. Protégé allows the editing of all aspects of an ontology including classes and relationships, logical axioms (in the OWL language) and metadata. Protégé furthermore includes builtin support for the execution of automated reasoners to check for logical errors and for ontology visualisation using various different algorithms. Examples of reasoners that can be used within Protégé are HermiT [ 15 ] and Fact++ [ 16 ]. For the rapid editing and construction of ontologies, various utilities are available, such as the creation of a large number of classes in a single 'wizard' step.
The software is open source and has a pluggable architecture, which allows for custom modular extensions. Protégé is able to open both OBO and OWL fi les, but it is designed primarily for the OWL language. An alternative editor specifi c to the OBO language is OBO-Edit [ 17 ]. Relative to Protégé, OBO-Edit offers more sophisticated metadata searching and a more intuitive user interface. To browse, search and navigate within a wide variety of bioontologies without installing any software or downloading any fi les, the BioPortal web platform provides an indispensable resource [ 18 ] that is especially important when using terminology from multiple ontologies. Additional browsing interfaces for multiple ontologies include the OLS [ 19 ] and OntoBee [ 20 ]. Most ontologies are also supported by one or more browsing interfaces specifi c to that single ontology, and for the Gene Ontology the most commonly used interfaces are AmiGO [ 21 ] and QuickGO [ 22 ].
Large-scale ontologies such as the GO and ChEBI are often additionally supported by custom-built software tailored to their specifi c use case, for example embedding the capability to create species-specifi c 'slims' (subsets of terms of the greatest interest within the ontology for a specifi c scenario) for the GO, or cheminformatics support for ChEBI. As ontologies are shared across communities of users, an important part of the tool support profi le is tools for the community to provide feedback and to submit additional entries to the ontology.

Applications
The purposes that are supported by modern bio-ontologies are diverse. The most straightforward application of ontologies is to support the structured annotation of data in a database. Here, ontologies are used to provide unique, stable identifi ers-associated to a controlled vocabulary-around which experimental data or manually captured reference information can be gathered [ 23 ]. An ontology annotation links a database entry or experimental result to an ontology class identifi er, which, being independent of the single database or resource being annotated, is able to be shared across multiple contexts. Without such shared identifi ers for biological entities, discrepant ways of referring to entities tend to accumulate-different key words, or synonyms, or variants of identifying labels-which signifi cantly hinders reuse and integration of the relevant data in different contexts.
Secondly, ontologies can serve as a rich source of vocabulary for a domain of interest, providing a dictionary of names, synonyms and interrelationships, thereby facilitating text mining (the automated discovery of knowledge from text) [ 24 ], intelligent searching (such as automatic query expansion and synonym searching, an example is described in [ 25 ]) and unambiguous identifi cation. When used in multiple independent contexts, such a common vocabulary can become additionally powerful. For example, uniting the representation of biological entities across different model organisms allows common annotations to be aggregated across species [ 26 ], which facilitates the translation of results from one organism into another in a fashion essential for the modern accumulation of knowledge in molecular biology. The use of a shared ontology also allows the comparison and translation entities from one discipline to another such as between biology and chemistry [ 27 ], enabling interdisciplinary tools that would be impossible computationally without a unifi ed reference vocabulary.
While the above applications would be possible even if ontologies consisted only of controlled vocabularies (standardised sets of vocabulary terms), the real power of ontologies comes with their hierarchical organisation and use of formal inter-entity relationships. Through the hierarchy of the ontology, it is possible to annotate data to the most specifi c applicable term but then to examine large-scale data in aggregate for patterns at the higher level categories. By centralising the hierarchical organisation in an application-independent ontology, different sources of data can be aggregated to converge as evidence for the same class-level inferences, and complex statistical tools can be built around knowledge bases of ontologies combined with their annotations, which check for over-representation or under-representation of given classes in the context of a given dataset relative to the background of everything that is known [ 28 ] (for more information see Chap. 13 [ 29 ]). The knowledge-based relationships captured in the ontology can be used to assign quantitative measures of similarity between entities that would otherwise lack a quantifiable comparative metric [ 30 ] (for more information see Chap. 12 [ 31 ]). And the relationships between entities can be used to power sophisticated knowledge-based reasoning, such as the inference of which organs, tissues and cells belong to in anatomical contexts [ 32 ].
With all these applications in mind, it is no wonder that the number and scope of bio-ontologies have been proliferating over the last decades. The OBO Foundry is a community organisation that offers a web portal in which participating ontologies are listed [ 33 ]. The web portal currently lists 137 ontologies, excluding obsolete records. Each of these ontologies has biological relevance and has agreed to abide by several community principles, including providing the ontology under an open license. Examples of these ontologies include ChEBI, the FMA, the Disease Ontology [ 34 ] and of course the Gene Ontology which is the topic of this book. In the context of the OBO Foundry, different ontologies are now becoming interrelated through inter-ontology relationships [ 35 ], and where there are overlaps in content they are being resolved through community workshops.

Limitations
Ontologies are a powerful technology for encoding domain knowledge in computable form in order to drive a multitude of different applications. However, they are not one-stop solutions for all knowledge representation requirements. There are certain limitations to the type of knowledge they can encode and the ways that applications can make use of that encoded knowledge.
Firstly, it is important to bear in mind that ontologies are based on logic. They are good at representing statements that are either true or false (categorical), but they cannot elegantly represent knowledge that is vague, statistical or conditional [ 36 ]. Classes that derive their meaning from comparison to a dynamic or conditional group (e.g. the shortest person in the room , which may vary widely) are also not possible to represent well within ontologies. It can be diffi cult to adequately capture knowledge about change over time at the class level, i.e. classes in which the members participate in relationships at one time and not at another, as including a temporal index for each relation would require ternary relations which neither the OBO nor the OWL language support. Furthermore, although the underlying technology for representation and automated reasoning has advanced a lot in recent years, there are still pragmatic limits to ensure the scalability of the reasoning tools. For this reason, higher order logical statements, non-binary relationships and other complex logical constructs cannot yet be represented and reasoned with in most of the modern ontology languages.
Open Access This chapter is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http:// creativecommons.org/licenses/by/4.0/ ), which permits use, duplication, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, a link is provided to the Creative Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work's Creative Commons license, unless indicated otherwise in the credit line; if such material is not included in the work's Creative Commons license and the respective action is not permitted by statutory regulation, users will need to obtain permission from the license holder to duplicate, adapt or reproduce the material.