Primer on the Gene Ontology
The Gene Ontology (GO) project is the largest resource for cataloguing gene function. The combination of solid conceptual underpinnings and a practical set of features have made the GO a widely adopted resource in the research community and an essential resource for data analysis. In this chapter, we provide a concise primer for all users of the GO. We briefly introduce the structure of the ontology and explain how to interpret annotations associated with the GO.
Key wordsGene Ontology structure Evidence codes Annotations Gene association file (GAF) GO files Function Vocabulary Annotation evidence
The key motivation behind the Gene Ontology (GO) was the observation that similar genes often have conserved functions in different organisms . Clearly, a common vocabulary was needed to be able to compare the roles of orthologous genes (and their products) across different species. The value of comparative studies of biological function across systems predates Jacques Monod’s statement that “anything found to be true of E. coli must also be true of elephants” . The Gene Ontology aims to produce a rigorous shared vocabulary to describe the roles of genes across different organisms . The GO project consists of the Gene Ontology itself, which models biological aspects in a structured way, and annotations, which associate genes or gene products with terms from the Gene Ontology. Combining information from all organisms in one central repository makes it possible to integrate knowledge from different databases, to infer the functionality of newly discovered genes, and to gain insight into the conservation and divergence of biological subsystems.
In this primer, we review the fundamentals of the GO project. The chapter is organised as answers to five essential questions: What is the GO? Why use it? Who develops it and provides annotations? What are the elements of a GO annotation? And finally, how can the reader learn more about GO resources?
2 What Is the Gene Ontology?
The Gene Ontology is a controlled vocabulary of terms to represent biology in a structured way. The terms are subdivided into three distinct ontologies that represent different biological aspects: Molecular Function (MF), Biological Process (BP), and Cellular Component (CC) . These ontologies are non-redundant and share a common space of identifiers and a well-specified syntax.
The full GO is large: in October 2015, the full ontology specification had 43835 terms, 73776 explicitly encoded is_a relationships, 7436 explicitly encoded part_of relationships, and 8263 explicitly encoded regulates, negatively_regulates, or positively_regulates relationships. This level of detail is not necessary for all applications. Many research groups who do GO annotations for specific projects use the generic GO-slim file, which is a manually curated subset of the Gene Ontology containing general, high-level terms across all biological aspects. There are several GO slims,1 ranging from the general Generic GO slim developed by the GO Consortium to more specific ones, such as the Chembl Drug Target slim.2
To keep up with the current state of knowledge, as well as to correct inaccuracies, the GO undergoes frequent revisions: changes of relationships between terms, addition of new terms, or term removal (obsoletion). Terms are never deleted from the ontology, but their status changes to obsolete and all relationships to the term are removed . Furthermore, the name itself is preceded by the word “obsolete” and the rationale for the obsoletion is typically found in the Comment field of the term. An example of an obsolete term is GO:0000005, “obsolete ribosomal chaperone activity”. This MF GO term was made obsolete “because it refers to a class of gene products and a biological process rather than a molecular function”.3 Changes to the relationships do not impact annotations, because annotations are associated with a given GO term regardless of its relationships to other terms within the GO. Obsoletion of terms however has an impact on annotations associated with them: in some cases, the old term can be automatically replaced by a new or a parent one; in others, the change is so important that the annotations must be manually reviewed.
However, these changes can affect the analyses done using the ontology. In articles or reports, it is good practice to provide the version of the file used for a particular analysis. In GO, the version number is the date the file was obtained from the GO site (GO files are updated daily).
3 Why Use the Gene Ontology?
Because it provides a standardised vocabulary for describing gene and gene product functions and locations, the GO can be used to query a database in search of genes’ function or location within the cell or to search for genes that share characteristics . The hierarchical structure of the GO allows to compare proteins annotated to different terms in the ontology, as long as the terms have relationships to each other. Terms located close together in the ontology graph (i.e. with a few intermediate terms between them) tend to be semantically more similar than those further apart (see Chap. 12 on comparing terms ).
The GO is frequently used to analyse the results of high-throughput experiments. One common use is to infer commonalities in the location or function of genes that are over- or under-expressed [6, 9, 10]. In functional profiling, the GO is used to determine which processes are different between sets of genes. This is done by using a likelihood-ratio test to determine if GO terms are represented differently between the two gene sets .
Additionally, the GO can be used to infer the function of unannotated genes. Gene predictions with significant similarity to annotated genes can be assigned one or several of the functions of the characterised genes. Other methods such as the presence of specific protein domains can also be used to assign GO terms [11, 12]. This is discussed in Chap. 5 .
While Gene Ontology resources facilitate powerful inferences and analyses, researchers using the GO should familiarise themselves with the structure of the ontology and also with the methods and assumptions behind the tools they use to ensure that their results are valid. Common pitfalls and remedies are detailed in Chap. 14 .
4 Who Develops the GO and Produces Annotations?
The GO Consortium consists of a number of large databases working together to define standardised ontologies and provide annotations to the GO . The groups that constitute the GO consortium include UniProt , Mouse Genome Informatics , Saccharomyces Genome Database , Wormbase , Flybase , dictyBase , and TAIR . In addition, several other groups contribute annotations, such as EcoCyc  and the Functional Gene Annotation group at University College London .4 Within each group, biocurators assign annotations according to their expertise . Further, the GO Consortium has mechanisms by which members of the broader community (see Chap. 7 ) can suggest improvements to the ontology and annotations.
5 What Are the Elements of a GO Annotation?
This section describes the different elements composing an annotation and some important considerations about each of them. The annotation process from a curator standpoint is discussed in detail in Chap. 4 .
Fundamentally, a GO annotation is the association of a gene product with a GO term. From its inception, the GO Consortium has recognised the importance of providing supporting information alongside this association. For instance, annotations always include information about the evidence supporting the annotation.
5.1 Annotation Object
The annotation object is the entity associated with a GO term—a gene, a protein, a non-protein-coding RNA, a macromolecular complex, or another gene product. Seven fields of the GAF file specify the annotation object. Each annotation in the GO is associated with a database (field 1) and a database accession number (field 2) that together provide a unique identifier for the gene, the gene product, or the complex. For example, the protein record P00519 is a database object in the UniProtKB database (Fig. 2). The database object symbol (field 3), the database object name (field 10), and the database object synonyms (field 11) provide additional information about the annotation object. The database object type specifies whether the object being annotated is a gene, or a gene product (e.g. protein or RNA; field 12). The organism from which the annotation object is derived is captured as the NCBI taxon ID (taxon; field 13); the corresponding species name can be found at the NCBI taxonomy website.5
GO allows capturing isoform-specific data when appropriate; for example UniProtKB accession numbers P00519-1 and P00519-2 are the isoform identifiers for isoform 1 and 2 of P00519. In this case, the database ID still refers to the main isoform, and an isoform accession is included in the GAF file as “Gene Product Form ID” (field 17).
5.2 GO Term, Annotation Extension, and Qualifier
Three fields are used to specify the function of the annotation object. Field 5 specifies the GO term, while field 9 denotes the sub-ontology of GO, either Molecular Function, Biological Process, or Cellular Component. While this information is also encoded in the GO hierarchy, explicitly denoting the sub-ontology allows to simplify parsing of the annotations according to the GO aspect. Field 4 denotes the qualifier. One of the three qualifiers can modify the interpretation of an annotation: “contributes_to”, “colocalizes_with” and “NOT”. This field is not mandatory, but if present it can profoundly change the meaning of an annotation . Thus, while the producers of annotations may omit qualifiers, applications that consume GO annotations must take them into account. The importance of qualifiers is discussed in more detail in Chap. 14 .
An additional field, field 16, is a recent addition to combine more than one term or concept (protein, cell type, etc.) in the same annotation. For example,6 if a gene product Slp1 is localised to the plasma membrane of T-cells, the GAF file field 16 would contain the information “part_of(CL:0000084 T cell)”. Here, CL:0000084 is the identifier for T-cell in the OBO Cell Type (CL) Ontology. This is covered in details in Chap. 17  on annotation extensions.
5.3 Evidence Code and Reference Field
5.3.1 Experimentally Supported Annotations
Annotations based on direct experimental evidence found in the primary literature are denoted with the general evidence code EXP (Inferred from Experiment) or, when appropriate, the more specific evidence codes IDA (Inferred from Direct Assay), IPI (Inferred from Physical Interaction), IMP (Inferred from Mutant Phenotype), IGI (Inferred from Genetic Interaction), and IEP (Inferred from Expression Pattern) (Fig. 3). These annotations are held in high regard by the community, e.g. , and are often used in applications such as checking the enrichment of a gene set in particular functions, finding genes that perform a specific function, or assessing involvement in specific pathways or processes.
Another important use of experimentally supported annotations is in providing trustworthy training sets for various computational methods that infer function . Used this way, the experimentally supported annotations can be amplified to understand more of the growing set of newly sequenced genes.
5.3.2 Curated Non-experimental Annotations
Fourteen of the 21 evidence codes are associated with manually curated non-experimental annotations. Annotations associated with these codes are curated in the sense that every annotation is reviewed by a curator, but they are non-experimental in the sense that there is no direct experimental evidence in the primary literature underpinning them; instead, they are inferred by curators based on different kinds of analyses.
ISS (Inferred from Sequence or Structural Similarity) is a superclass (i.e. a parent) of ISA (Inferred from Sequence Alignment), ISO (Inferred from Sequence Orthology), and ISM (Inferred from Sequence Model) evidence codes. Each of the three subcategories of ISS should be used when only one method was used to make the inference. For example, to improve the accuracy of function propagation by sequence similarity, many methods take into account the evolutionary relationships among genes. Most of these methods rely on orthology (ISO evidence code), because the function of orthologs tends to be more conserved across species than paralogs [32, 33]. In a typical analysis, characterised and uncharacterised genes are clustered based on sequence similarity measures and phylogenetic relationships. The function of unknown genes is then inferred from the function of characterised genes within the same cluster (e.g. [34, 35]).
Another approach to function prediction entails supervised machine learning based on features derived from protein sequence [36, 37, 38, 39] (ISM evidence code). Such approach uses a training set of classified sequences to learn features that can be used to infer gene functions. Although few explicit assumptions about the complex relationship between protein sequence and function are required, the results are dependent on the accuracy and completeness of the training data.
IGC (Inferred from Genomic Context) includes, but is not limited to, such things as identity of the genes neighbouring the gene product in question (i.e. synteny), operon structure, and phylogenetic or other whole-genome analysis.
Relatively new are four evidence codes associated with phylogenetic analyses. IBA (Inferred from Biological aspect of Ancestor) and IBD (Inferred from Biological aspect of Descendant) indicate annotations that are propagated along a gene tree. Note that the latter is only applicable to ancestral genes. The loss of an active site, a binding site, or a domain critical for a particular function can be annotated using the IKR (Inferred from Key Residues) evidence code. When this code is assigned by PAINT, GO’s Phylogenetic Annotation and INference Tool , this means that it is a prediction based on evolutionary neighbours. Finally, negative annotations can be assigned to highly divergent sequences using the code IRD (Inferred from Rapid Divergence).
RCA (inferred from Reviewed Computational Analysis) captures annotations derived from predictions based on computational analyses of large-scale experimental data sets, or based on computational analyses that integrate datasets of several types, including experimental data (e.g. expression data, protein-protein interaction data, genetic interaction data), sequence data (e.g. promoter sequence, sequence-based structural predictions), or mathematical models.
Next, there are two types of annotations derived from author statements. Traceable Author Statement (TAS) refers to papers where the result is cited, but not the original evidence itself, such as review papers. On the other hand a NAS (Non-traceable Author Statement) refers to a statement in a database entry or statements in papers that cannot be traced to another paper.
The final two evidence codes for curated non-experimental annotations are IC (Inferred by Curator) and ND (No biological Data available). If an assignment of a GO term is made using the curator’s expert knowledge, concluding from the context of the available data, but without any direct evidence available, the IC evidence code is used. For example, if a eukaryotic protein is annotated with the MF term “DNA ligase activity”, the curator can assign the BP term “DNA ligation” and CC term “nucleus” with the evidence code IC.
The ND evidence code indicates that the function is currently unknown (i.e. that no characterisation of the gene is currently available). Such an annotation is made to the root of the respective ontology to indicate which functional aspect is unknown. Hence, the ND evidence code allows users for a subtle difference between unannotated genes (for which the literature has not been completely reviewed and thus no GO annotation has been made) and uncharacterised genes (GO annotation with ND code). Note that the ND code is also different from an annotation with the “NOT” qualifier (which indicates the absence of a particular function).
5.3.3 Automatically Assigned Annotations
The evidence code IEA (Inferred from Electronic Annotation) is used for all inferences made without human supervision, regardless of the method used. IEA evidence code is by far the most abundantly used evidence code. The guiding idea behind computational function annotation is the notion that genes with similar sequences or structures are likely to be evolutionarily related, and thus, assuming that they largely kept their ancestral function, they might still have similar functional roles today. For an in-depth discussion of computational methods for GO function annotations, refer to Chap. 5 or see refs. [13, 41].
5.3.4 Additional Considerations About Evidence Codes
Biases associated with the different evidence codes are discussed in Chap. 14. Note that there is a more extensive Evidence and Conclusion Ontology (ECO; ), formerly known as the “Evidence Code Ontology”, presented in Chap. 18 . ECO is only partially implemented in the GO: ECOs are displayed in the AmiGO browser, but they are not in the GAF file. However, all Evidence Codes used by the GO are found also in ECO. There is a general assumption among the GO user community that annotations based on experiments are of higher quality compared to those generated electronically, but this has yet to be empirically demonstrated. Generally, annotations derived from automatic methods tend to be to high-level terms, so they may have a lower information value, but they often withstand scrutiny. Conversely, experiments are sometimes overinterpreted (see Chap. 4 ) and can also contain inaccuracies.
5.4 Uniqueness of GO Annotations (or Lack Thereof)
No two annotations can have the same combination of the following fields: gene/protein ID, GO term, evidence code, reference, and isoform. Thus one gene can be annotated to the same term with more than one evidence code.
Most GO analyses are gene based, and therefore it is important in such analyses to make sure that the list of genes is non-redundant. However, annotations are often made to larger protein sets that include multiple proteins from the same gene. This is particularly evident in UniProt, which can contain distinct entries from the TrEMBL (unreviewed) portion of the database that do not necessarily represent biologically distinct proteins. The different entries for the same protein or gene are often annotated with identical GO terms, which can bias statistical analyses because some genes have many more entries than other genes. For instance, the set of human proteins in UniProt comprises over 70,000 entries, but there are only approximately 20,000 recognised human protein-coding genes (20,187 reviewed human proteins in the UniProt release of 2015_12). The GO Consortium has worked with UniProt as well as the Quest for Orthologs Consortium to develop “gene-centric” reference proteome lists (http://www.uniprot.org/proteomes/) that provide a single “canonical” UniProt entry for each protein-coding gene. These lists are available for many species, and we encourage users performing gene-centric GO analyses to use only the annotations for UniProt entries in these lists.
6 How Can I Learn More About Gene Ontology Resources?
Most of the topics introduced in this primer will be treated in more depth and nuance in later chapters. Part II focuses on the creation of GO function annotations—we cover in depth the two main strategies of creating GO function annotations: manual extraction/curation from the literature and computational prediction. Part III describes the main strategies used to evaluate their predictive performance. Part IV covers practical uses of the GO annotations: we discuss how GO terms and GO annotations can be summed and compared, how enrichment in specific GO terms can be analysed, and how the GO annotations can be visualised. For the advanced GO user, Part V discusses how the context of a GO annotation is recorded and goes beyond the Evidence Codes to describe how to capture more information on the source of an annotation. We end with Part VI by going beyond GO: we present alternatives to GO for functional annotation; we show how a structured vocabulary is used in the context of controlled clinical terminologies; and we present how information from different structured vocabularies is integrated in one overarching resource.
The authors gratefully acknowledge extensive feedback and ideas from Kimberly Van Auken, Marcus C. Chibucos, Prudence Mutowo, and Paul D. Thomas. PG acknowledges National Institutes of Health/National Human Genome Research Institute grant HG002273. CD acknowledges Swiss National Science Foundation grant 150654 and UK BBSRC grant BB/M015009/1. JH acknowledges National Institutes of Health/National Institute for General Medical Sciences grant U24GM088849. Open Access charges were funded by the University College London Library, the Swiss Institute of Bioinformatics, the Agassiz Foundation, and the Foundation for the University of Lausanne.
- 3.Hastings J (2016) Primer on ontologies. In: Dessimoz C, Škunca N (eds) The gene ontology handbook. Methods in molecular biology, vol 1446. Humana Press. Chapter 1Google Scholar
- 4.Munoz-Torres M, Carbon S (2016) Get GO! retrieving GO data using AmiGO, QuickGO, API, files, and tools. In: Dessimoz C, Škunca N (eds) The gene ontology handbook. Methods in molecular biology, vol 1446. Humana Press. Chapter 11Google Scholar
- 8.Pesquita C (2016) Semantic similarity in the gene ontology. In: Dessimoz C, Škunca N (eds) The gene ontology handbook. Methods in molecular biology, vol 1446. Humana Press. Chapter 12Google Scholar
- 10.Bauer S (2016) Gene-category analysis. In: Dessimoz C, Škunca N (eds) The gene ontology handbook. Methods in molecular biology, vol 1446. Humana Press. Chapter 13Google Scholar
- 11.Burge S, Kelly E, Lonsdale D, et al. (2012) Manual GO annotation of predictive protein signatures: the InterPro approach to GO curation. Database:bar068Google Scholar
- 13.Cozzetto D, Jones DT (2016) Computational methods for annotation transfers from sequence. In: Dessimoz C, Škunca N (eds) The gene ontology handbook. Methods in molecular biology, vol 1446. Humana Press. Chapter 5Google Scholar
- 14.Gaudet P, Dessimoz C (2016) Gene ontology: pitfalls, biases, and remedies. In: Dessimoz C, Škunca N (eds) The gene ontology handbook. Methods in molecular biology, vol 1446. Humana Press. Chapter 14Google Scholar
- 17.Drabkin HJ, Blake JA, Mouse Genome Informatics Database (2012) Manual gene ontology annotation workflow at the Mouse Genome Informatics Database. Database:bas045Google Scholar
- 19.Davis P, WormBase Consortium (2009) WormBase – nematode biology and genomes. http://precedings.nature.com/documents/3127/version/1/files/npre20093127-1.pdf
- 25.Burge S, Attwood TK, Bateman A, et al. (2012) Biocurators and biocuration: surveying the 21st century challenges. Database:bar059Google Scholar
- 26.Lovering RC (2016) How does the scientific community contribute to gene ontology? In: Dessimoz C, Škunca N (eds) The gene ontology handbook. Methods in molecular biology, vol 1446. Humana Press. Chapter 7Google Scholar
- 27.Poux S, Gaudet P (2016) Best practices in manual annotation with the gene ontology. In: Dessimoz C, Škunca N (eds) The gene ontology handbook. Methods in molecular biology, vol 1446. Humana Press. Chapter 4Google Scholar
- 28.Huntley RP, Lovering RC (2016) Annotation extensions. In: Dessimoz C, Škunca N (eds) The gene ontology handbook. Methods in molecular biology, vol 1446. Humana Press. Chapter 17Google Scholar
- 31.The Reference Genome Group of the Gene Ontology Consortium (2009) The Gene Ontology’s Reference Genome Project: a unified framework for functional annotation across species. PLoS Comput Biol 5:e1000431Google Scholar
- 42.Chibucos MC, Mungall CJ, Balakrishnan R et al. (2014) Standardized description of scientific evidence using the Evidence Ontology (ECO). Database:bau075Google Scholar
- 43.Chibucos MC, Siegele DA, Hu JC, Giglio M (2016) The evidence and conclusion ontology (ECO): supporting GO annotations. In: Dessimoz C, Škunca N (eds) The gene ontology handbook. Methods in molecular biology, vol 1446. Humana Press. Chapter 18Google Scholar
This chapter is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, duplication, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, a link is provided to the Creative Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative Commons license, unless indicated otherwise in the credit line; if such material is not included in the work’s Creative Commons license and the respective action is not permitted by statutory regulation, users will need to obtain permission from the license holder to duplicate, adapt or reproduce the material.