The Vision and Challenges of the Gene Ontology

The overarching goal of the Gene Ontology (GO) Consortium is to provide researchers in biology and biomedicine with all current functional information concerning genes and the cellular context under which these occur. When the GO was started in the 1990s surprisingly little attention had been given to how functional information about genes was to be uniformly captured, structured in a computable form, and made accessible to biologists. Because knowledge of gene, protein, ncRNA, and molecular complex roles is continuously accumulating and changing, the GO needed to be a dynamic resource, accurately tracking ongoing research results over time. Here I describe the progress that has been made over the years towards this goal, and the work that still remains to be done, to make of the Gene Ontology (GO) Consortium realize its goal of offering the most comprehensive and up-to-date resource for information on gene function.

agreement. Those of us building these data resources (Including Amos Bairoch, Jonathan Bard, David Botstein, Michelle Gwinn, Minoru Kanehisa, Stan Letovsky, and Monica Riley) were avidly discussing what might be done. Biologists needed a way of making some sense of the information we were so diligently collecting about genes, both to locate information and to traverse across taxa.
Specifi cally one slightly obsessive biologist, Michael Ashburner, wanted to classify all fl y genes and have the corresponding worm, mouse, human, yeast groups use the same classifi cation scheme (see ftp://ftp.ebi.ac.uk/pub/databases/edgp/misc/ashburner/ fl y_function_tree for an early example, and ftp://ftp.geneontology.org/pub/go/www/gene.ontology.discussion.shtml for the white paper as it was fi rst publicly presented in 1998). That way, if he found a fl y gene involved in a particular process, he could then ask what genes in other taxa are (thought) to be involved in the "same" process, and what insights can be gleaned from its counterpart? We needed a way to describe the attributes of gene products in a rigorous way that would enable biologists to roam the universe of genomes and biology, to explore: temporally and spatially characteristic expression patterns; the specifi c (often) cellular compartment localization where they acted; whether they were constitutive parts of particular cellular components and/or complexes; and their biochemical or physiological functions and activities. These are attributes of genes that are of great interest to all biologists. And in an ideal world all biological databases would agree on how such information can be made discoverable and comparable.

Desiderata (Principles) Circa 1996-1997 (Banbury & Les Treilles)
Two seminal workshops were organized in 1996 and 1997 largely devoted to discussing the need for agreement among the genomic resources on how semantic comparability should be achieved. The fi rst of these was sponsored by the Banbury Center, 1 (organized by M. Ashburner, E. Harlow, P. Karp and J. Witkowski), and the second on building genome databases sponsored by the Fondation des Treilles 2 (organized by W.M. Gelbart, and M. Ashburner). These meetings set the stage for the Gene Ontology Consortium by defi ning our working defi nitions and essential principles.
These axiomatic working defi nitions, begin with "gene product": a physical object, typically associated with a gene or genes indirectly through transcription and translation (for proteins), affecting some biological process. Such things as proteins, ncRNAs, protein complexes, and so forth are all typical functional objects. These were the objects to be described. In turn the essential attributes of a gene product-its function, the process(es) it participates in, and the cellular location at which these occur-were also defi ned: Function being a capability that a physical gene product carries as a potential, describing only what a gene product can do, without necessarily specifying where or when this usage actually occurs; Process as a transformation that has a temporal aspect to it, even if virtually instantaneous, accomplished via one or more ordered assemblies of functions; And (originally) cellular component as an anatomical structure within the cell, a location in which a function or process occurs (since expanded to include extracellular space).
Following agreement on these basic defi nitions came the animated discussions on the desired (and required) characteristics for actual operations.

Essentials for the "Ontology"
The name "Gene Ontology" was originally a jest, but the joke was on us, as it turns out GO is indeed an ontology-at least in the computational sense, with the primary operational data structure now being OWL. Every attribute mandated at the outset has proven its worth and remains at the core of the GO. Some of these essential criteria are outlined here.
It was rapidly understood that unique identifi ers were essential operationally. This allowed the collaborating resources to reference the ontology classes (terms) unambiguously and stably. Furthermore by using a semantically meaningless identifi er, as opposed to using the label as the identifi er, we were free to change the label at any time, and to display different preferred labels for different communities. At the time this was a major difference compared to other frame based systems such as "Ontolingua" or even Ontology Web Language (OWL, although OWL did not exist at the time) which used the label (name) as the identifi er.
It was also determined that it would be essential for the GO terms to have a graphical relationship to each other, rather than the prevalent norm in biology at the time: a fl at list of keywords used for tagging. In the early, consciously simplistic, model GO began with there were only two relationship types: is_a and part_of . But it was recognized even then that more relationships would ultimately be required.
The decision to make numerical identifi ers the stable GO "object" had implications for the human readable labels. And, in addition, rather than attempting to convey all pertinent biological information by encoding it directly into the label, human readable defi nitions would provide the defi nitive defi nition. Thus it is the defi nition, not the label, which defi nes an ontology class in GO. If a label

Human Readable Defi nitions and Labels
changes it means nothing and there are no serious consequences. If a defi nition changes, such that the meaning of the class has changed, then this has obvious consequences for any gene product that was annotated to the original class. Thus the original class is made obsolete, with a reference to the new class as a suggestion, and the new class is given a new identifi er.
Another misconception that often needs to be clarifi ed is that GO has nothing to do with nomenclature. The confusion arises because we are using (and often have to use) exactly the same words to describe both the product and its function. For example, "alcohol dehydrogenase" can describe what you can put in an Eppendorf tube (the gene product) or it can describe the function of this protein. There is, however, a formal difference-a "gene product" has (potentially) a many-to-many relationship with a "function." That is to say there are many gene products that have the function "alcohol dehydrogenase" (and some of these may indeed be encoded by a gene with the name alcohol dehydrogenase, but many will not be). Moreover a particular gene product may have both functions "alcohol dehydrogenase" and "acetaldehyde dismutase" and possibly more. Since GO's remit is describing functions and processes, nomenclature is irrelevant to its purpose.
Finally, the labels themselves are intended to be familiar to researchers using GO. Over the years some unfortunate "standardization" efforts, have rendered terms non-user-friendly (for example what researchers call a transcription factor is "sequence-specifi c DNA binding RNA polymerase II transcription factor activity" in the GO). The consequence is that both annotation and searching are made more error-prone and diffi cult because the familiar term, that a biologist would instinctively use, cannot be quickly located. The GO Consortium continues working to rectify these labeling issues, both by an effort to use familiar labels and through the judicious use of synonyms.
Multiple synonyms of different fl avors are essential for allowing GO to deal with: colloquialisms, community preferences, abbreviations, legacy names, the multiple ways of referring to chemical elements, capitalization, and all the possible variations that occur in natural language. Because our top priority was communication of biological knowledge, we needed GO to accommodate every individual researcher by speaking in their particular idiom.
In 2000 we began to maintain a history of the ontology and of each term. Comprehensive snapshots of both the ontology and the annotations are taken on a monthly basis enabling progress to be quantifi ed and retrospective analyses to be carried out. Additionally, from the outset, date stamping and authorship for each class were captured. Originally, and currently, the form is

Versioning the Ontology and Classes:
The History of Changes rather rudimentary: (Modifi ed|Added|Deleted|Split from |Merged with ) by fi rstnameorinitial,surname yymmdd. This early decision to support "micro-attribution" remains valid, but the form is gradually transitioning into a more modern approach through the development of online editing and annotation tools with authentication and authorization.
From the outset members of the community were asking for subsets of the GO containing only the major categories and subcategories, or a branch of relevance to their particular application. These "Slims" enable the users to broadly group their gene products using a very limited set of broad categories, or confi ne themselves to specifi c branches dealing with a particular biological topic, or constrain the GO by a taxonomic criterion. "Slims" are handled internally by tagging the different GO classes as members of various categories. These GO subsets are used in multiple different ways: for high-level classifi cation; for defi ning sub-branches at the fi nest granularity; for clade specifi c versions; and other utility subsets.

Applying the GO
We determined that collaborating databases would be responsible for attributing any functional assignment to a source (e.g., a literature reference or computational analysis) and for indicating the type evidence used by this attribution source. The initial set of "evidence codes" was primed from this short list: This enabled statements such as "Publication NNN" asserted that "gene A" has "function XYZ" by inference from a "direct assay." Since this time evidence codes have developed into an autonomous ontology [ 1 ] and discussed in Chap. 18 [ 2 ] but the principle remains the same: if you are asserting that something is true then you must provide the evidence-its general category and the published reference-for making this assertion.
GO did not arise from nothing. Like every technology it used what came before it. Furthermore, given that we wanted to give attribution to our predecessors and provide a migration path for anyone with legacy data that had utilized these prior vocabularies. This practice came out of our own need as well. As the ontology was being built up we wanted to track some of our original sources.
Another expressivity requirement was to allow assertions stating that a given gene product does not hold for a given GO class. Experimentalists often test for an expected function, with negative results. Rather than lose this information we needed to provide a solution that could convey such negative results. Hence we provide for qualifi ers on the GO annotations.
Like most of the challenges facing the GO we recognized the need for identifying classes that are taxon specifi c in the very early years (1996 or earlier). The solution fi nally fell into place when the taxon-constraint resource and corresponding web service were implemented (e.g., http://owlservices.berkeleybop.org/isClassA pplicableForTaxon?format=txt&idstyle=obo&id=GO:0005737&t axid=NCBITaxon:131567 ) [ 5 ].
Following the precept of test early and often, the fi rst annotation effort began at SGD in early 1999. Fly genes were already "annotated" because these were the seeds that GO grew from. The question was how well proto-GO, based on the needs of fl y, would translate to another, very different, organism. An extremely simple tab-delimited annotation format was devised and the dialog began. Similarly the fi rst automated pipeline "love-at-fi rst-sight" was developed by Mark Yandell in late 1999 [ 6 , 7 ] to describe the genes of the newly completed fl y and human genomes. It was straightforward inference based on BLAST alignments, but it provided a reasonable overview of the landscape. The response to these fi rst efforts was overwhelmingly positive and adoption of GO very quickly accelerated.
The GO project remains focused on providing an integrated data resource for functional information, both experimental (Chaps. 4 and 6 [ 8 , 9 ]) and predicted (Chap. 5 [ 10 ]), for all known proteins, noncoding RNA sequences, and cellular components. In other words, carrying out comprehensive functional annotation is what drives the project, not the ontology itself. The ontology provides the biological model that serves as the conceptual scaffolding for the biological data. The Gene Ontology database contains currently over 5.2 million function annotations for almost 900,000 gene products (mostly proteins but also some noncoding RNAs). About 660,000 of these annotations are based on experimental results reported in the

Annotation
published literature, and the remainder are predictions derived from a variety of different methods. All of which are freely available for the community to use. That said, there is still considerable room for improvement. There was, and remains, a signifi cant amount of accumulated knowledge to be captured. In particular for human, the annotation task is still more about capturing old data than capturing new data because an equivalent to a Model Organism Database does not exist. Until the day the GO catches up it will need to capture existing data in parallel with capturing more recent data to achieve the coverage it aims for.

Where We Stand Today
Based on the wide adoption by the community, we can claim that the project met a real need. The GO is a useful alternative to simple nomenclature, as nomenclature fails to fully convey the biology and is too limited to describe protein roles fully. There is still a long way ahead: several of the key elements that we recognized as essential in the nineties are still works in progress today.
In 1999 we decided at the fi rst offi cial GO meeting against implementing relationships across the three branches of the GO until a later time. Needless to say this drastically over-simplifi ed the biological model, a simplifi cation we were fully cognizant of but one that allowed us to prioritize our work. In this simplistic model with which GO began there were only two relationship types: is_a and part_of. And even here the meaning of part_of was confl ated, since part_of in the cellular component branch of GO meant that that it was a sub-component while part_of in BP meant a step or subprocess. Since that time we continue to work on enriching the Relations Ontology and applying it appropriately ( https://github. com/oborel/obo-relations ). Currently there are eight relationships in use. Most signifi cantly the three branches of the GO the ontologies are now being linked.
We did not and do not want multiple "rival" ontologies for one domain. The initial necessity for embedding terms within other terms led to the creation of numerous implicit ontologies embedded within the GO (chemicals, anatomical parts, tissues, and cell types). In the early years, while we recognized that this might be dealt with by incorporating the unique identifi er that refers to the full defi nition elsewhere, in practice this could not be reliably accomplished at that time and it is taking some time to remedy.
Work to rectify the situation began shortly after the turn of the century [ 11 ] and has given rise to a small set of core ontologies,

Orthogonality
The Vision and Challenges of the Gene Ontology which have been teased out of the GO and replaced by including the unique identifi er for the new class as part of the logical defi nition of the GO class. The fi rst exercise was replacing all implicit references to chemicals in the GO with explicit references to ChEBI classes [ 12 ]. Similarly the Cell Type ontology was derived from the GO [ 13 -15 ] and, as an autonomous ontology, has proven its own value for other applications. Expression analyses and RNASeq experiments often draw their samples from particular cell types and projects such as ENCODE [ 16 ] and FANTOM [ 17 ] are using the cell type ontology to indicate the source cell type for their data. In addition, there are coordinated efforts connecting the cell line ontology, used in cancer studies, to the cell type ontology to indicate the original cell type [ 18 ]. There is immense benefi t to constructing any ontology from its most element components because it provides a connective route across the widest possible network of projects. For example, RNA expression data from a cancer study that used a particular cell line can be automatically connected to an ENCODE RNA expression data from a normal cell type.
As regards anatomy, Jonathan Bard initially raised the question of how we might consider a common language for anatomy. It was clear that we needed a methodology for anatomical interoperability and querying data across our various organisms, not just for gene function, but ultimately for phenotypes as well. As with chemicals and cell types, a species-neutral anatomical ontology was extracted from GO, but also incorporated existing anatomical ontologies (e.g., mouse, zebrafi sh, fossils) thereby creating bridges between them [ 19 -21 ]. Beyond its use by GO Uberon is connecting phenotype data, for example, from human (it is used for the logical defi nitions of the Human Phenotype Ontology) to mouse (likewise there are logical defi nitions underlying the Mouse Phenotype ontology) with direct applicability to human health research [ 22 ].
The challenge of comparability and interoperability can largely be overcome by community adoption of a small set of standard elemental core ontologies, from which special purpose ontologies, which meet the unique needs of a given project, can be constructed. It is hard to emphasize this enough. While the community seems to be blooming with a cacophony of idiosyncratic "ontologies" the GO is actively working to reduce the proliferation by deconstructing its terms into the elemental core set of conceptual classes needed to defi ne its complex terms. This approach is producing enormous dividends in terms of interoperability and comparability across widely divergent data sets.
The context in which a function is carried out was recognized from the outset as crucial. For example, the role of glucagon-mediated signal transduction in liver concerns gluconeogenesis, glycogenolysis and plasma glucose homeostasis, whereas the role of this process in adipose tissue is lipolysis. At the level of gene products, the role of cytochrome C is in oxidative phosphorylation and energy

Contextual Annotation
supply (when it is in the mitochondrion), and apoptosis (when it is in the cytoplasm). This has proven operationally (that is: how easy it is for someone to annotate) to be one of our biggest challenges ( see Chap. 17 [ 23 ] on annotation extensions). While this has given curators a great deal more expressivity it still can be improved upon, and developing new annotation strategies and methods is where GO is actively working.

What Lies Ahead
The fundamental motivation driving the GO has remained unchanged: we are attempting to build a realistic model of biology to enable research, based on the collective evidence gathered by the research community. As originally envisioned we needed a way to describe the attributes of gene products in a rigorous way that would enable biologists to explore the universe of genomes and biology. As described above we were cognizant of them all initially and incrementally are addressing them and taking advantage of technological advances as we go.
That said, the GO is predicated upon a reliable foundation of "annotation." To gather accumulated knowledge as well as keep up with new research requires us to continue to seek new, more effi cient approaches for biologists to provide their data. This is one of our current big challenges. One approach is collaborative data exchange with other annotation initiatives. For example, our collaborations with Reactome ( http://www.reactome.org/ ) and IntAct ( http://www.ebi.ac.uk/intact/ ) allow data from these resources to be incorporated into GO. Another key strategy is community annotation, such as described in Chap. 7 [ 24 ], which has provided GO with additional annotations. Our future plans are to provide online community annotation tools, which will also be used by GO Consortium curators-tools that will also support refi nement of the GO itself in addition to providing annotations.
Providing a resource that captures functional data for every extant protein is, to say the least, a formidable challenge. One obvious reason is that most sequences are not, nor ever will be, experimentally characterized (and not just because of volume, but also because some are experimentally intractable). Therefore most annotations must necessarily be based on predictions. Furthermore, for inferences to be as accurate as possible they should be predicated on an explicit evolutionary framework. For the past several years a small group of GO curators have been using an annotation tool, Phylogenetic Annotation and INference Tool (PAINT) [ 25 ] to infer annotations among members of a protein family. PAINT allows curators to make precise assertions as to when functions were gained and lost during evolution and record the evidence

Phylogenetic Annotation
(i.e., the experimentally supported GO annotations from the leaves of the tree and their phylogenetic relationship to an ancestral protein) for those assertions. PAINT is as yet a stand-alone desktop application, but work is underway to incorporate it into a suite of integrated, online annotation tools for GO curators and community contributors. Among the other tools in current development is one based on biological modules.
Biological systems are modular at many levels. For example, within a single domain a catalytic site may be coupled to an (allosteric) binding site that regulates the catalytic activity. Or, within a single protein different domains may form a module, e.g., the ligand binding domain and protein kinase domain of a transmembrane protein kinase receptor. And further up the size are functional modules composed from subunits within a macromolecular complex (e.g., the ribosome). And, at an even higher level, molecular interactions can defi ne a pathway that can be used or reused in multiple different processes (e.g., the ubiquitin-dependent proteolysis pathway or JAK-STAT pathway). The goal of this modular approach is to defi ne each GO term through a combination of terms, and enable extensible representation of biological modularity: how elemental molecular interactions are combined in different ways to produce compound molecular functions, how molecular functions are combined to produce processes, and how processes are combined to produce larger processes. A fi rst release of this curation tool (dubbed "Noctua" 3 ) is now being evaluated by GO curators. One notable feature of this new tool is that it combines the tasks of annotation and ontology construction. Historically the artifi cial disconnect between these two inseparable tasks created serious bottlenecks, as annotators were forced to wait for a separate group to create or modify requisite terms. With Noctua the curators will more directly describing biology, with known relationships in the ontology associated with specifi c instances that support this model.

Summary
The goal of the Gene Ontology (GO) project is to provide a uniform way to describe the functions of gene products from organisms across all kingdoms of life and thereby enable analysis of genomic data. It is an ongoing enterprise as our understanding of biology grows and is refi ned. It is a computational model of biological reality that we ultimately hope every researcher will happily contribute to and regard as the optimum means of sharing the knowledge they have gained from their own research with the wider community. 3 Little owl ( Athene noctua ) is a bird that was sacred to the goddess Athena , the Greek goddess of wisdom.

Modular Annotation
Funding Open Access charges were funded by the University College London Library, the Swiss Institute of Bioinformatics, the Agassiz Foundation, and the Foundation for the University of Lausanne.
Open Access This chapter is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http:// creativecommons.org/licenses/by/4.0/ ), which permits use, duplication, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, a link is provided to the Creative Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work's Creative Commons license, unless indicated otherwise in the credit line; if such material is not included in the work's Creative Commons license and the respective action is not permitted by statutory regulation, users will need to obtain permission from the license holder to duplicate, adapt or reproduce the material.