1 Introduction

Examining aspects of the world to determine the nature of the entities that exist and their causal networks is at the heart of many scientific endeavours, including the modern biological sciences. Advances in technology have made it possible to perform large-scale high-throughput experiments, yielding results for thousands of genes or gene products in single experiments. The data from these experiments are growing in public repositories [1], and in many cases the bottleneck has moved from the generation of these data to the analysis thereof [2]. In addition to the sheer volume of data, as the focus has moved to the investigation of systems as a whole and their perturbations [3], it has become increasingly necessary to integrate data from a variety of disparate technologies, experiments, labs and even across disciplines. Natural language data description is not sufficient to ensure smooth data integration, as natural language allows for multiple words to mean the same thing, and single words to mean multiple things. There are many cases where the meaning of a natural language description is not fully unambiguous. Ontologies have emerged as a key technology going beyond natural language in addressing these challenges. The most successful biological ontology (bio-ontology) is the Gene Ontology (GO) [4], which is the subject of this volume.

Ontologies are computational structures that describe the entities and relationships of a domain of interest in a structured computable format, which allows for their use in multiple applications [5, 6]. At the heart of any ontology is a set of entities, also called classes, which are arranged into a hierarchy from the general to the specific. Additional information may be captured such as domain-relevant relationships between entities or even complex logical axioms. These entities that are contained in ontologies are then available for use as hubs around which data can be organised, indexed, aggregated and interpreted, across multiple different services, databases and applications [7].

2 Elements of Ontologies

Ontologies consist of several distinct elements, including classes, metadata, relationships, formats and axioms.

2.1 Classes

The class is the basic unit within an ontology, representing a type of thing in a domain of interest, for example carboxylic acid, heart, melanoma and apoptosis. Typically, classes are associated with a unique identifier within the ontology’s namespace, for example (respectively) CHEBI:33575, FMA:7088, DOID:1909 and GO:0006915. Such identifiers are semantics free (they do not contain a reference to the class name or definition) in order to promote stability even as scientific knowledge and the accompanying ontology representation evolve. Ontology providers commit to maintaining identifiers for the long term, so that if they are used in annotations or other application contexts the user can rely on their resolution. In some cases as the ontology evolves, multiple entries may become merged into one, but in these cases alternate identifiers are still maintained as secondary identifiers. When a class is deemed to no longer be needed within the ontology it may be marked as obsolete, which then indicates that the ID should not be used in further annotations, although it is preserved for historical reasons. Obsolete classes may contain metadata pointing to one or more alternative classes that should be used instead.

2.2 Metadata

Classes are usually associated with annotated textual information—metadata. The metadata associated with classes may include any associated secondary (alternate) identifiers and flags to indicate whether the class has been marked as obsolete. It may also include one or more synonyms; for example the synonyms of apoptotic process (a class in the GO) include cell suicide, programmed cell death and apoptosis. It further may include cross references to that class in alternative databases and web resources. For example, many Chemical Entities of Biological Interest (ChEBI) [8] entries contain cross references to the KEGG resource [9], which represents those chemicals in the context of the biological pathways they participate in. Textual comments and examples of intended usage may be annotated. It is very important that each class include a clear definition, which provides enough information to pinpoint the meaning of the class and suggest its appropriate use—sufficiently distinguishing different classes in an ontology so that a user can determine which is the best to use for annotation. The definition of apoptosis offered by the Gene Ontology is as follows:

A programmed cell death process which begins when a cell receives an internal (e.g. DNA damage) or external signal (e.g. an extracellular death ligand), and proceeds through a series of biochemical events (signaling pathway phase) which trigger an execution phase. The execution phase is the last step of an apoptotic process, and is typically characterized by rounding-up of the cell, retraction of pseudopodes, reduction of cellular volume (pyknosis), chromatin condensation, nuclear fragmentation (karyorrhexis), plasma membrane blebbing and fragmentation of the cell into apoptotic bodies. When the execution phase is completed, the cell has died.

2.3 Relations

Classes are arranged in a hierarchy from the general (high in the hierarchy) to the specific (low in the hierarchy). For example, in ChEBI carboxylic acid is classified as a carbon oxoacid, which in turn is classified as an oxoacid, which in turn is classified as a hydroxide, and so on up to the root chemical entity, which is the most general term in the structure-based classification branch of the ontology.

Despite the hierarchical organisation, most ontologies are not simple trees. Rather, they are structured as directed acyclic graphs. This is because it is possible for classes to have multiple parents in the classification hierarchy, and furthermore ontologies include additional types of relationships between entities other than hierarchical classification (which itself is represented by is_a relations). All relations are directed and care must be taken by the ontology editors to ensure that the overall structure of the ontology does not contain cycles, as illustrated in Fig. 1.

Fig. 1
figure 1

(a) A simple hierarchical tree, (b) a directed, acyclic graph, (c) a graph that contains a cycle, indicated in red

A common relationship type used in multiple ontologies is part_of or has_part, representing composition or constitution. For example, in the Foundational Model of Anatomy (FMA) [10], heart has_part aortic valve. The Relationship Ontology (RO) defines several relationship types that are commonly used across multiple bio-ontologies [11], a selection of which is shown in Table 1.

Table 1 A selection of relationship types commonly used in bio-ontologies

In addition, specific ontologies may also include additional relationships that are particular to their domain. For example, GO includes biological process-specific relations such as regulates, while ChEBI includes chemistry-specific relationships such as is_tautomer_of and is_enantiomer_of.

The specification for a relationship type in an ontology includes a unique identifier, name and classification hierarchy, as for classes, as well as a specification whether the relationship is reflexive (i.e. A rel B if and only if B rel A) and/or transitive (if A rel B and B rel C then A rel C), and the name of the inverse relationship type if it exists. The same metadata as is associated with the classes in the ontology may also be associated with relationship types: alternative identifiers, synonyms, a definition and comments, and a flag to indicate if the relationship is obsolete.

2.4 Formats

Typically, ontologies are stored in files conforming to a specific file format, although there are exceptions that are stored in custom-built infrastructures. Ontologies may be represented in different underlying ontology languages, and historically there has been an evolution of the capability of ontology languages towards greater logical expressivity and complexity, which is mirrored by the advances in computational capacity (hardware) and tools. Biological ontologies such as the GO have historically been represented in the human-readable Open Biomedical Ontologies (OBO) language,Footnote 1 which was designed specifically for the structure and metadata content associated with bio-ontologies, but in recent years there has been a move towards the Semantic Web standard Web Ontology Language (OWL)Footnote 2 largely due to the latter’s adoption within a wider community and expansive tool support. Within OWL, specific standardised annotations are used to encode the metadata content of bio-ontologies as OWL annotations. However, the distinction has become cosmetic to some extent, as tools have been created which are able to interconvert between these languages [12], provided that certain constraints are adhered to.

2.5 Axioms

Within logic-based languages such as OWL, statements in ontologies have a definite logical meaning within a set-based logical theory. Classes have instances as members, and logical axioms define constraints on class definitions that apply to all class members. For example, the statement carboxylic acid is_a carbon oxoacid has the logical meaning that all instances of carboxylic acid are also instances of carbon oxoacid:

$$ \forall \kern0.28em x\kern0.28em :\kern0.28em CarboxylicAcid(x)\to CarbonOxoacid(x) $$

The logical languages underlying ontology technology are collectively called Description Logics [13]—in the plural because there are different variants with different levels of complexity. Some of the different ingredients of logical axioms that are available in the OWL language—quantification, cardinality, logical connectives and negation, disjointness and class equivalence—are explained in Table 2.

Table 2 Logical constructs available in the OWL language

Like the carboxylic acid example above, each of these axiom types can be expressed as a logical statement. With these axioms, logic-based ontology reasoners are able to check for errors in an ontology. For example, if a class relation is quantified with ‘only’ such as the hydrocarbon example given in the table, which in logical language means

$$ \forall \kern0.28em x\kern0.28em \forall \kern0.28em y:\kern0.28em Hydrocarbon(x)\wedge hasPart\left(x,y\right)\leftrightarrow \kern0.28em Hydrogen(y)\vee Carbon\kern0.28em (y) $$

and then if a subclass of hydrocarbon in the ontology has a has_part relation with a target other than a hydrogen or a carbon (e.g. an oxygen):

$$ Hydrocarbon(a)\wedge hasPart\left(a,\kern0.28em b\right)\wedge Oxygen(b) $$

that class will be detected as inconsistent and flagged as such by the reasoner.

The end result—an ontology which combines terminological knowledge with complex domain knowledge captured in logical form—is thus amenable to various sophisticated tools which are able to use the captured knowledge to check for errors, derive inferences and support analyses.

3 Tools

Developing a complex computational knowledge base such as a bio-ontology (for example, the Gene Ontology includes 43,980 classes) requires tool support at multiple levels to assist the human knowledge engineers (curators) with their monumental task. For editing ontologies, a commonly used freely available platform is Protégé [14]. Protégé allows the editing of all aspects of an ontology including classes and relationships, logical axioms (in the OWL language) and metadata. Protégé furthermore includes built-in support for the execution of automated reasoners to check for logical errors and for ontology visualisation using various different algorithms. Examples of reasoners that can be used within Protégé are HermiT [15] and Fact++ [16]. For the rapid editing and construction of ontologies, various utilities are available, such as the creation of a large number of classes in a single ‘wizard’ step. The software is open source and has a pluggable architecture, which allows for custom modular extensions. Protégé is able to open both OBO and OWL files, but it is designed primarily for the OWL language. An alternative editor specific to the OBO language is OBO-Edit [17]. Relative to Protégé, OBO-Edit offers more sophisticated metadata searching and a more intuitive user interface.

To browse, search and navigate within a wide variety of bio-ontologies without installing any software or downloading any files, the BioPortal web platform provides an indispensable resource [18] that is especially important when using terminology from multiple ontologies. Additional browsing interfaces for multiple ontologies include the OLS [19] and OntoBee [20]. Most ontologies are also supported by one or more browsing interfaces specific to that single ontology, and for the Gene Ontology the most commonly used interfaces are AmiGO [21] and QuickGO [22].

Large-scale ontologies such as the GO and ChEBI are often additionally supported by custom-built software tailored to their specific use case, for example embedding the capability to create species-specific ‘slims’ (subsets of terms of the greatest interest within the ontology for a specific scenario) for the GO, or cheminformatics support for ChEBI. As ontologies are shared across communities of users, an important part of the tool support profile is tools for the community to provide feedback and to submit additional entries to the ontology.

4 Applications

The purposes that are supported by modern bio-ontologies are diverse. The most straightforward application of ontologies is to support the structured annotation of data in a database. Here, ontologies are used to provide unique, stable identifiers—associated to a controlled vocabulary—around which experimental data or manually captured reference information can be gathered [23]. An ontology annotation links a database entry or experimental result to an ontology class identifier, which, being independent of the single database or resource being annotated, is able to be shared across multiple contexts. Without such shared identifiers for biological entities, discrepant ways of referring to entities tend to accumulate—different key words, or synonyms, or variants of identifying labels—which significantly hinders reuse and integration of the relevant data in different contexts.

Secondly, ontologies can serve as a rich source of vocabulary for a domain of interest, providing a dictionary of names, synonyms and interrelationships, thereby facilitating text mining (the automated discovery of knowledge from text) [24], intelligent searching (such as automatic query expansion and synonym searching, an example is described in [25]) and unambiguous identification. When used in multiple independent contexts, such a common vocabulary can become additionally powerful. For example, uniting the representation of biological entities across different model organisms allows common annotations to be aggregated across species [26], which facilitates the translation of results from one organism into another in a fashion essential for the modern accumulation of knowledge in molecular biology. The use of a shared ontology also allows the comparison and translation entities from one discipline to another such as between biology and chemistry [27], enabling interdisciplinary tools that would be impossible computationally without a unified reference vocabulary.

While the above applications would be possible even if ontologies consisted only of controlled vocabularies (standardised sets of vocabulary terms), the real power of ontologies comes with their hierarchical organisation and use of formal inter-entity relationships. Through the hierarchy of the ontology, it is possible to annotate data to the most specific applicable term but then to examine large-scale data in aggregate for patterns at the higher level categories. By centralising the hierarchical organisation in an application-independent ontology, different sources of data can be aggregated to converge as evidence for the same class-level inferences, and complex statistical tools can be built around knowledge bases of ontologies combined with their annotations, which check for over-representation or under-representation of given classes in the context of a given dataset relative to the background of everything that is known [28] (for more information see Chap. 13 [29]). The knowledge-based relationships captured in the ontology can be used to assign quantitative measures of similarity between entities that would otherwise lack a quantifiable comparative metric [30] (for more information see Chap. 12 [31]). And the relationships between entities can be used to power sophisticated knowledge-based reasoning, such as the inference of which organs, tissues and cells belong to in anatomical contexts [32].

With all these applications in mind, it is no wonder that the number and scope of bio-ontologies have been proliferating over the last decades. The OBO Foundry is a community organisation that offers a web portal in which participating ontologies are listed [33]. The web portal currently lists 137 ontologies, excluding obsolete records. Each of these ontologies has biological relevance and has agreed to abide by several community principles, including providing the ontology under an open license. Examples of these ontologies include ChEBI, the FMA, the Disease Ontology [34] and of course the Gene Ontology which is the topic of this book. In the context of the OBO Foundry, different ontologies are now becoming interrelated through inter-ontology relationships [35], and where there are overlaps in content they are being resolved through community workshops.

5 Limitations

Ontologies are a powerful technology for encoding domain knowledge in computable form in order to drive a multitude of different applications. However, they are not one-stop solutions for all knowledge representation requirements. There are certain limitations to the type of knowledge they can encode and the ways that applications can make use of that encoded knowledge.

Firstly, it is important to bear in mind that ontologies are based on logic. They are good at representing statements that are either true or false (categorical), but they cannot elegantly represent knowledge that is vague, statistical or conditional [36]. Classes that derive their meaning from comparison to a dynamic or conditional group (e.g. the shortest person in the room, which may vary widely) are also not possible to represent well within ontologies. It can be difficult to adequately capture knowledge about change over time at the class level, i.e. classes in which the members participate in relationships at one time and not at another, as including a temporal index for each relation would require ternary relations which neither the OBO nor the OWL language support.

Furthermore, although the underlying technology for representation and automated reasoning has advanced a lot in recent years, there are still pragmatic limits to ensure the scalability of the reasoning tools. For this reason, higher order logical statements, non-binary relationships and other complex logical constructs cannot yet be represented and reasoned with in most of the modern ontology languages.