Primer on the Gene Ontology

The Gene Ontology (GO) project is the largest resource for cataloguing gene function. The combination of solid conceptual underpinnings and a practical set of features have made the GO a widely adopted resource in the research community and an essential resource for data analysis. In this chapter, we provide a concise primer for all users of the GO. We briefly introduce the structure of the ontology and explain how to interpret annotations associated with the GO.


Introduction
The key motivation behind the Gene Ontology (GO) was the observation that similar genes often have conserved functions in different organisms (1) . Clearly, a common vocabulary was needed to be able to compare the roles of orthologous genes (and their products) across different species. The value of comparative studies of biological function across systems predates Jacques Monod's statement that "anything found to be true of E. coli must also be true of elephants" (2) . The Gene Ontology aims to produce a rigorous shared vocabulary to describe the roles of genes across different organisms (1) . The GO project consists of the Gene Ontology itself, which models biological aspects in a structured way, and annotations , which associate genes or gene products with terms from the Gene Ontology. Combining information from all organisms in one central repository makes it possible to integrate knowledge from different databases, to infer the functionality of newly discovered genes, and to gain insight into the conservation and divergence of biological subsystems.
In this primer, we review the fundamentals of the GO project. The chapter is organised as answers to five essential questions: What is the GO? Why use it? Who develops it and provides annotations? What are the elements of a GO annotation? And finally, how can the reader learn more about GO resources?

What is the Gene Ontology?
The Gene Ontology is a controlled vocabulary of terms to represent biology in a structured way. The terms are subdivided in three distinct ontologies that represent different biological aspects: Molecular Function (MF), Biological Process (BP), and Cellular Component (CC) (1) . These ontologies are non-redundant and share a common space of identifiers and a well-specified syntax.
Terms are linked to each other by relations to form a hierarchical vocabulary ( add ref to Janna Hasting's chapter) . This is often modelled as a graph in which the relationships form the directed edges, and the terms are the nodes ( Figure 1). Since each term can have multiple relationships to broader parent terms and to more specific child terms, the structure allows for more expressivity than a simple hierarchy.
The full GO is large: in October 2015, the full ontology specification had 43835 terms, 73776 explicitly encoded i s _ a relationships, 7436 explicitly encoded p a r t _ o f relationships, and 8263 explicitly encoded r e g u l a t e s , n e g a t i v e l y _ r e g u l a t e s or p o s i t i v e l y _ r e g u l a t e s relationships. This level of detail is not necessary for all applications.
Many research groups who do GO annotations for specific projects use the generic GO-slim file, which is a manually curated subset of the Gene Ontology containing general, high-level terms across all biological aspects. There are several GO slims , ranging from the general Generic GO 1 slim developed by the GO Consortium to more specific ones, such as the Chembl Drug Target slim (http://wwwdev.ebi.ac.uk/chembl/target/browser) .
1 http://geneontology.org/page/go-slim-and-subset-guide term "regulation of cell projection assembly," GO:0060491, to its root term. The GO is a directed graph with terms as nodes and relationships as edges; these relationships are either i s _ a , p a r t _ o f , h a s _ p a r t , or r e g u l a t e s . In its basic representation, there should be no cycles in this graph, and we can therefore establish parent (more general) and child (more specific) terms (see Chap. XX for more details on the different representations; cross-reference to Moni's chapter ). Note that it is possible for a term to have multiple parents. This figure is based on the visualization available from the AmiGO browser, generated on November 6, 2015. (3) .
To keep up with the current state of knowledge, as well as to correct inaccuracies, the GO undergoes frequent revisions: changes of relationships between terms, addition of new terms, or term removal (obsoletion). Terms are never deleted from the ontology, but their status changes to obsolete and all relationships to the term are removed (4) . Furthermore, the name itself is preceded by the word "obsolete" and the rationale for the obsoletion is typically found in the Comment field of the term. An example of an obsolete term is GO:0000005, "obsolete ribosomal chaperone activity." This MF GO term was made obsolete "because it refers to a class of gene products and a biological process rather than a molecular function" . Changes to the 2 relationships do not impact annotations, because annotations are associated with a given GO term regardless of its relationships to other terms within the GO. Obsoletion of terms however have an impact on annotations associated with them: in some cases, the old term can be automatically replaced by a new or a parent one; in others, the change is so important that the annotations must be manually reviewed.
However, these changes can affect the analyses done using the ontology. In articles or reports, it is good practice to provide the version of the file used for a particular analysis. In GO, the version number is the date the file was obtained from the GO site (GO files are updated daily).

Why use the Gene Ontology?
Because it provides a standardised vocabulary for describing gene and gene product functions and locations, the GO can be used to query a database in search of genes' function or location within the cell or to search for genes that share characteristics (5) . The hierarchical structure of the GO allows to compare proteins annotated to different terms in the ontology, as long as the terms have relationships to each other. Terms located close together in the ontology graph (i.e., with a few intermediate terms between them) tend to be semantically more similar than those further apart ( see Chap. of Catia Pesquita on comparing terms ).
The GO is frequently used to analyse the results of high-throughput experiments. One common use is to infer commonalities in the location or function of genes that are over-or under-expressed (4, 6) [ +cross-reference to Sebastian Bauer's chapter] . In functional profiling, the GO is used to determine which processes are different between sets of genes. This is done by using a likelihood-ratio test to determine if GO terms are represented differently between the two gene sets (4) .
Additionally, the GO can be used to infer the function of unannotated genes. Gene predictions with significant similarity to annotated genes can be assigned one or several of the functions of the characterized genes. Other methods such as the presence of specific protein domains can also be used to assign GO terms (7,8) . This is discussed in Chap. XX ( x-ref to Cozzetto and Jones ).
A wealth of tools-web-based services, standalone software, and programing interfaces-has been developed for applying the GO to various tasks. Some of these are While Gene Ontology resources facilitate powerful inferences and analyses, researchers using the GO should familiarise themselves with the structure of the ontology and also with the methods and assumptions behind the tools they use to ensure that their results are valid.
Common pitfalls and remedies are detailed in Chap. XX ( x-ref to Gaudet and Dessimoz chapter ).

Who develops the GO and produces annotations?
The GO Consortium consists of a number of large databases working together to define standardised ontologies and provide annotations to the GO (9) . The groups that constitute the

What are the elements of a GO annotation?
This section describes the different elements composing an annotation and some important considerations about each of them. The annotation process from a curator standpoint is discussed in detail in the chapter by Gaudet and Poux ( cross-reference ).
Fundamentally, a GO annotation is the association of a gene product with a GO term.  that cardinality is either one or two. When cardinality is greater than 1, elements in the field are separated with a pipe character or with a comma; the former indicates 'OR' and the latter indicates 'AND'. The GO term assigned in column 5 is always the most specific GO term possible.

Annotation object
The annotation object is the entity associated with a GO term-a gene, a protein, a non-protein-coding RNA, a macromolecular complex, or another gene product. Seven fields of the GAF file specify the annotation object. Each annotation in the GO is associated with a database (field 1) and a database accession number (field 2) that together provide a unique identifier for the gene, the gene product, or the complex. For example, the protein record P00519 is a database object in the UniProtKB database ( Figure 2). The database object symbol (field 3), the database object name (field 10), and the database object synonyms (field 11) provide additional information about the annotation object. The database object type specifies whether the object being annotated is a gene, or a gene product (e.g., protein or RNA; field 12).
The organism from which the annotation object is derived is captured as the NCBI taxon ID (taxon; field 13); the corresponding species name can be found at the NCBI taxonomy website . 4 GO allows capturing isoform-specific data when appropriate, for example UniProtKB accession numbers P00519-1 and P00519-2 are the isoform identifiers for isoform 1 and 2 of P00519. In this case, the database ID still refers to the main isoform, and an isoform accession is included in the GAF file as "Gene Product Form ID" (field 17).

GO term, annotation extension, and qualifier
Three fields are used to specify the function of the annotation object.

Evidence code and reference field
Three fields in the GAF file describe the evidence used to assert the annotation: the Reference (field 6), the Evidence Code (field 7), and the With/From (field 8). The Evidence Code informs the type of experiment or analysis that supports the annotation. There are 21 evidence codes, which can be grouped in three broad categories: experimental annotations, curated non-experimental annotations, and automatically assigned (also known as electronic) annotations ( Figure 3). The Reference field specifies more details on the source of the annotation. For example, when the evidence code denotes an experimentally supported annotation, the Reference will contain the PubMed accession ID (or a DOI if no PubMed ID is available) of the journal article which underpins the annotation, or a GO_REF identifier that refers to a short description of the assignment method, accessible on the GO website . When the 6 evidence code denotes an automatically assigned annotation, i.e. IEA, the reference will contain a GO_REF identifiers that specify more details on the automatic assignment, e.g., annotation via the InterPro resource (20) .

Experimentally supported annotations
Annotations based on direct experimental evidence found in the primary literature are denoted with the general evidence code EXP (Inferred from Experiment) or, when appropriate, the more specific evidence codes IDA (Inferred from Direct Assay), IPI (Inferred from Physical Interaction), IMP (Inferred from Mutant Phenotype), IGI (Inferred from Genetic Interaction), and IEP (Inferred from Expression Pattern) (Figure 3). These annotations are held in high regard by the community, e.g., (21) , and are often used in applications such as checking the enrichment of a gene set in particular functions, finding genes that perform a specific function, or assessing involvement in specific specific pathways or processes.
Another important use of experimentally supported annotations is in providing trustworthy training sets for various computational methods that infer function (22) . Used this way, the experimentally supported annotations can be amplified to understand more of the growing set of newly sequenced genes. that it is a prediction based on evolutionary neighbors. Finally, negative annotations can be assigned to highly divergent sequences using the code IRD (Inferred from Rapid Divergence).

RCA (inferred from Reviewed Computational Analysis) captures annotations derived from
predictions based on computational analyses of large-scale experimental data sets, or based on computational analyses that integrate datasets of several types, including experimental data (e.g. expression data, protein-protein interaction data, genetic interaction data, etc.), sequence data (e.g. promoter sequence, sequence-based structural predictions, etc.), or mathematical models.
Next, there are two types of annotations derived from author statements. Traceable Author Statement (TAS) refers to papers where the result is cited, but not the original evidence itself, such as review papers. On the other hand a NAS (Non-traceable Author Statement) refers to a statement in a database entry or statements in papers that cannot be traced to another paper.
The final two evidence codes for curated non-experimental annotations are IC (Inferred by Curator) and ND (No biological Data available). If an assignment of a GO term is made using the curator's expert knowledge, concluding from the context of the available data, but without any direct evidence available, the IC evidence code is used. For example, if a eukaryotic protein is annotated with the MF term "DNA ligase activity," the curator can assign the BP term "DNA ligation" and CC term "nucleus" with the evidence code IC.
The ND evidence code indicates that the function is currently unknown (i.e. that no characterization of the gene is currently available). Such an annotation is made to the root of the respective ontology to indicate which functional aspect is unknown. Hence, the ND evidence code allows users for a subtle difference between unannotated genes (for which the literature has not been completely reviewed and thus no GO annotation has been made) and uncharacterised genes (GO annotation with ND code). Note that the ND code is also different from an annotation with the "NOT" qualifier (which indicates the absence of a particular function).

Automatically assigned annotations
The evidence code IEA (Inferred from Electronic Annotation) is used for all inferences made without human supervision, regardless of the method used. IEA evidence code is by far the most abundantly used evidence code. The guiding idea behind computational function annotation is the notion that genes with similar sequences or structures are likely to be evolutionarily related, and thus, assuming they largely kept their ancestral function, they might still have similar functional roles today. For an in-depth discussion of computational methods for GO function annotations, refer to Chap. XX (chapter by Cozzetto and Jones) or see (32) .

Additional considerations about evidence codes
Biases associated with the different evidence codes are discussed in the chapter by Gaudet and Dessimoz ( x-ref ). Note that there is a more extensive Evidence and Conclusion Ontology (ECO; 33) , formerly known as the "Evidence Code Ontology", presented in Chap XXX.
ECO is only partially implemented in the GO: ECOs are displayed in the AmiGO browser, but they are not in the GAF file. However, all Evidence Codes used by the GO are found also in ECO.
There is a general assumption among the GO user community that annotations based on experiments are of higher quality compared to those generated electronically, but this has yet to be empirically demonstrated. Generally, annotations derived from automatic methods tend to be to high level terms, so they may have a lower information value, but they often withstand scrutiny. Conversely, experiments are sometimes overinterpreted (see Gaudet and Poux chapter) and can also contain inaccuracies.

Uniqueness of GO annotations (or lack thereof)
No two annotations can have the same combination of the following fields: gene/protein ID, GO term, evidence code, reference and isoform. Thus one gene can be annotated to the same term with more than one evidence code.
Most GO analyses are gene-based, and therefore it is important in such analyses to make sure the list of genes is non-redundant. However, annotations are often made to larger protein sets that include multiple proteins from the same gene. This is particularly evident in UniProt, which can contain distinct entries from the TrEMBL (unreviewed) portion of the database that do not necessarily represent biologically distinct proteins. The different entries for the same protein or gene are often annotated with identical GO terms, which can bias statistical analyses because some genes have many more entries than other genes. For instance, the set of human proteins in UniProt comprises over 70,000 entries, but there are only approximately 20,000 recognized human protein coding genes (20,187 reviewed human proteins in the UniProt release of 2015_12). The GO Consortium has worked with UniProt as well as the Quest for Orthologs Consortium to develop "gene-centric" reference proteome lists (http://www.uniprot.org/proteomes/) that provide a single "canonical" UniProt entry for each protein-coding gene. These lists are available for many species, and we encourage users performing gene-centric GO analyses to use only the annotations for UniProt entries in these lists.

How can I learn more about Gene Ontology resources?
Most of the topics introduced in this primer will be treated in more depth and nuance in later chapters. Part II focuses on the creation of GO function annotations-we cover in depth the two main strategies of creating GO function annotations: manual extraction/curation from the literature and computational prediction. Part III describes the main strategies used to evaluate their predictive performance. Part IV covers practical uses of the GO annotations: we discuss how GO terms and GO annotations can be summed and compared, how enrichment in specific GO terms can be analyzed, and how the GO annotations can be visualized. For the advanced GO user, part V discusses how the context of a GO annotation is recorded and goes beyond the Evidence Codes to describe how to capture more information on the source of an annotation.
We end with part VI by going beyond GO: we present alternatives to GO for functional annotation; we show how a structured vocabulary is used in the context of controlled clinical terminologies ; and we present how information from different structured vocabularies is integrated in one overarching resource.