Having discussed common pitfalls associated with the ontology structure, we now turn our attention to annotations. Understanding how annotations are done is essential to correctly interpreting the data. In particular, the information provided for each GO annotation extends beyond the mere association of a term with a protein (reference to Chap. 3 ). The full extent of this rich information, aimed to more precisely reflect the biology within the GO framework, is often overlooked.
3.1 Modification of Annotation Meaning by Qualifiers
The Gene Ontology uses three qualifiers that modify the meaning of association between a gene-product and a Gene Ontology term: These are “NOT”, “contributes to”, and “co-localizes with” (see documentation at http://geneontology.org/page/go-qualifiers).
The “contributes to” qualifier is used to capture the molecular function of complexes when the activity is distributed over several subunits. However, in some cases the usage of the qualifier is more permissive, and all subunits of a complex are annotated to the same molecular function even if they do not make a direct contribution to that activity. For example, the rat G2/mitotic-specific cyclin-B1 CCNB1 is annotated as contributing to histone kinase activity, based on data in , although it has only been shown to regulate the kinase activity of CDK1. Finding a cyclin annotated as having protein kinase activity may be unintuitive to users who fail to consider the “contributes to” qualifier.
The “co-localizes with” qualifier is used with two very different meanings: it first means that a protein is transiently or peripherally associated with an organelle or complex, while the second use is for cases where the resolution of an assay is not accurate enough to say that the gene product is a bona fide component member. Unfortunately, it is currently not possible to know which of the two meanings is meant in any given annotation.
3.2 Negative and Contradictory Results
The “NOT” qualifier is the one with the most impact, since it means that there is evidence that a gene product does not have a certain function. The “NOT” qualifier is mostly used when a specific function may be expected, but has shown to be missing, either based on closer review of the protein’s primary sequence (e.g., loss of an active site residue) or because it cannot be experimentally detected using standard assays.
The existence of negative annotations can also lead to apparent contradictions. For instance, protein ARR2 in Arabidopsis thaliana is associated with “response to ethylene” (GO:0009723) both positively on the basis of a paper by Hass et al.  and negatively based on a paper by Mason et al. . The latter discusses this contradiction as follows:
Hass et al.  reported a reduction in the ethylene sensitivity of seedlings containing an arr2 loss-of-function mutation. By contrast, we observed no significant difference from the wild type in the seedling ethylene response when we tested three independent arr2 insertion mutants, including the same mutant examined by Hass et al. . This difference in results could arise from differences in growth conditions, for, unlike Hass et al. , we used a medium containing Murashige and Skoog (MS) salts and inhibitors of ethylene biosynthesis.
Thus, in this case, the contradiction in the GO is a reflection of the primary literature. As Mason et al. note, this is not necessarily reflective of a mistake, as there can be differences in activity across space (tissue, subcellular localisation) and time (due to regulation), with some of these details not fully captured in the experiment or in its representation in the GO.
A NOT annotation may also be assigned to a protein that does not have an activity typical of its homologs, for instance the STRADA pseudokinase (UniProtKB:Q7RTN6); STRADA adopts a closed conformation typical of active protein kinases and binds substrates, promoting a conformational change in the substrate, which is then phosphorylated by a “true” protein kinase, STK11 . In this case, the “NOT” annotation is created to alert the user to the fact that although the sequence suggests that the protein has a certain activity, experimental evidence shows otherwise.
In contrast to positive annotations, “NOT” annotations propagate to children in the ontology graph and not to parents. To illustrate, a protein associated with a negative annotation to “protein kinase activity” is not a tyrosine protein kinase either, a more specific term.
3.3 Annotation Extensions
As also described in Chap. 17 , the Gene Ontology has recently introduced a mechanism, the “annotation extensions”, by which contextual information can be provided to increase the expressivity of the annotations . Until recently, annotations had consisted of an association between a gene product and a term from one of the three ontologies comprising the GO. With this new knowledge representation model, additional information about the context of a GO term such as the target gene or the location of a molecular function may be provided.
Common uses are to provide data regarding the location of the activity/process in which a protein or gene product participates. For example, the role of Mouse opsin-4 (MGI:1353425) in rhodopsin mediated signaling pathway is biologically relevant in retinal ganglion cells. Annotation extensions also allow capture of dynamic subcellular localization, such as the S. pombe bir1 protein (SPCC962.02c), which localizes to the spindle specifically during the mitotic anaphase. The annotation extensions can also be used to capture substrates of enzymes, which used to be outside the scope of GO.
The annotation extension data is available in the AmiGO  and QuickGO  browsers, as well as in the annotation files compliant with the GAF2.0 format (http://geneontology.org/page/go-annotation-file-gaf-format-20). However, because annotation extensions are relatively new, guidelines are still being developed, and some uses are inconsistent across different databases. Furthermore, most tools have yet to take this information into account.
In effect, extensions of an annotation create a “virtual” GO class that can be composed of more than one “actual” GO class, and can be traced up through multiple parent lineages. Thus, just as with inter-ontology links, accounting for annotation extensions can result in a substantial inflation in the number of annotations, which needs to be appropriately accounted for in enrichment analyses and other statistical analyses that require precise specification of GO term background distribution.
3.4 Biases Associated with Particular Evidence Codes
Annotations are backed by different types of experiments or analyses categorized according to evidence codes (Chap. 3 ). Different types of experiments provide varying degrees of precision and confidence with respect to the conclusions that can be derived from them. For most experiment types, it is not possible to provide a quantitative measure of confidence. Evidence codes are informative but cannot directly be used to exclude low-confidence data.Footnote 4 Nonetheless, the different evidence codes are prone to specific biases.
Direct evidence. Taking these caveats into account, the evidence code inferred from direct assay (abbreviated as IDA in the annotation files) provides the most reliable evidence with respect to the how directly a protein has been implicated in a given function, as it names implies.
Mutant phenotype evidence. Mutants are extremely useful to implicate genes products in pathways and processes; however exactly how the gene product is implicated in the process/function annotated is difficult to assess using phenotypic data because such data are inherently derivative. Therefore, associations between gene products and GO terms based on mutant phenotypes (abbreviated as IMP in the annotation files) may be weak. The same caveat applies to annotations derived from mutations in multiple genes, indicated by evidence code “inferred from genetic interaction” (IGI).
Physical interactions. Evidence based on physical interactions (IPI; mostly protein–protein interactions) is comparable in confidence to a direct assay for protein binding annotations or for cellular components; however for molecular functions and biological processes, the evidence is of the type “guilt by association” and is of low confidence. Inferences based on expression patterns (IEP) are typically of low confidence. The presence of a protein in a specific subcellular localization, at a specific developmental stage, or associated with a protein or a protein complex can provide a hint to uncover a protein’s role in the absence of other evidence, but without more direct evidence that information is very weak.
High-throughput experiments. Schnoes et al.  reported that annotations deriving from high-throughput experiments tend to consist of high-level GO terms, and tend to represent a limited number of functions. This artificially decreases the information content of these terms, since they are frequently annotated, and artificially decreased information content affects similarity analyses. This potentially has a large impact, since a significant fraction of the annotations in the GO database are derived from these types of analyses (as much as 25 %, according to Schnoes et al., who used the operational definition of a high throughput paper as one in which over 100 proteins were annotated). The GO does not currently record whether particular experimental annotations may be derived from high-throughput methods, but this may change in the future.
Biases from automatic annotation methods. The GO association file, containing the annotations, has information regarding the method used to assign electronic annotations. The annotations can be assigned by a large number of different methods. Examples include domain functions, as assigned for example by InterPro, by Enzyme Commission numbers being associated with an entry, by BLAST, by orthology assignment, etc. Note that this information is not provided as an evidence code, but as a “reference code”. The list of methods and their associated reference code is available at http://www.geneontology.org/cgi-bin/references.cgi. The large number of electronic annotations can also make them have a disproportionate impact on the results. Most analysis tools allow for the inclusion or exclusion of electronic annotations, but not at the more fine-grained level of the particular method. It is nevertheless possible to use the combination of evidence code plus reference (available at: http://www.geneontology.org/cgi-bin/references.cgi) to automatically deepen the evidence type, see https://raw.githubusercontent.com/evidenceontology/evidenceontology/master/gaf-eco-mapping.txt).
Note that a gene or gene product can have multiple annotations to the same term but with different evidence. This can provide corroborating information on particular genes, but may also require appropriate normalization in statistical analyses of term frequency, as the frequency of terms that can be determined through multiple types of experiments may be artificially inflated. Furthermore, because different experiments can vary in their specificity—thus resulting in annotations at different levels of granularity for basically the same function—this redundancy only becomes conspicuous when the transitivity of the ontology structure is appropriately taken into account.
For more discussion on evidence codes, and their use in quality control pipelines, refer to Chap. 18 .
3.5 Differences Among Species
There can be substantial differences in the nature and extent of GO annotations across different species. For instance, zebrafish is heavily studied in terms of developmental biology and embryogenesis while the rat is the standard model for toxicology. These differences are reflected in the frequency of GO terms across species, which can vary considerably across species . This has important implications on enrichment analyses and other statistical analyses requiring a background distribution of GO annotations. For instance, consider an experiment trying to establish the biological processes associated with a particular zebrafish protein by identifying its interaction partners and performing an enrichment analysis on them. If we naively use the entire database as background, the interaction partners might appear to be enriched in developmental genes simply because this class is over-represented in general in zebrafish. Instead, one should use zebrafish gene-related annotations only as background .
3.6 Authorship Bias
Other biases are less obvious but can nevertheless be strong and thus have a high potential to mislead. Recently, sets of annotations derived from the same scientific article were shown to be on average much more similar than annotations derived from different papers (Fig. 3; ). For instance, Nehrt et al. compared the functional similarity of orthologs (genes related through speciation) across different species and paralogs (genes related through duplication) within the same species, and observed a much higher level of functional conservation among the latter . However, this difference was almost entirely due to the fact that the GO functional annotations of same-species paralogs are ~50 times more likely to be derived from the same paper than orthologs; when controlling for authorship and other biases, the difference in functional similarity between same-species paralogs and orthologs vanished and even became in favor of orthologs .
Note that the difference is smaller but remains significant if we compare annotations established from different papers, but with at least one author in common, with annotations from different articles with no author in common.
3.7 Annotator Bias
Just as systematic differences among investigators can lead to the authorship bias, systematic differences in the way GO curators capture this information can lead to annotator bias. These annotator biases can in part be attributed to different annotation focus, but also to different interpretation or application of the GO annotation guidelines (http://geneontology.org/page/go-annotation-policies).
UniProt provides annotations for all species, which allows us to assess the effect of annotator (or database) bias. If we compare UniProt annotations for mouse proteins with those done by the Mouse Genome Informatics group (MGI), we see that comparable fractions of proteins are annotated using the different experimental evidence codes, with mutant phenotypes being the most widely used (78 % of experimental annotations in MGI, versus 63 % in UniProt), followed by direct assays (20 % of annotations in MGI and 32 % in UniProt).
However when we look at which GO terms are annotated based on phenotypes (IMP and IGI) by the two groups, we notice a large difference in the terms annotated. The top term annotated by MGI supported by the IMP evidence code is “in utero embryonic development”, with 1170 annotations to 1020 proteins. UniProt has only 4 annotations for this term. On the other hand, UniProt has as one of its top-annotated classes “regulation of circadian rhythm”, for 49 annotations to 38 proteins; 96 annotations for 69 proteins if we also include annotations to more specific, descendant terms. MGI on the other hand, only has 18 annotations for 19 proteins. This indicates that the annotations provided by different groups are biased towards specific aspects, and are not a uniform representation of the biology of all gene products in a species.
3.8 Propagation Bias
Another strong and perhaps surprising bias lies in the very different average GO similarity between electronic annotations compared with between experimental annotations. Indeed, if we consider homologous genes, their similarity in terms of electronic annotations tend to be much higher than in terms of experimental annotations, with curated annotations lying in-between (; Fig. 3). A likely explanation for this phenomenon is that electronic annotations are typically obtained by inferring annotations among homologous sequences, a process that can only increase the average functional similarity of homologs.
Because of this homology inference bias, one must exercise caution when drawing conclusions from sets of genes whose annotations might have different proportions of experimental vs. electronic annotations. For instance, this would be the case when comparing annotations from model organisms with those from non-model organisms (the latter being likely to consist mostly of electronic annotations obtained through propagation).
More subtly, because function conservation is generally believed to correlate with sequence similarity, many computational methods preferentially infer function among phylogenetically close homologs. This bias can thus confound analyses attempting to gauge the conservation of gene function across different levels of species divergence.
3.9 Imbalance Between Positive and Negative Annotations
As discussed above, both our knowledge of gene function and its representation in the GO remain very incomplete. We have already discussed the pitfalls of ignoring this fact altogether (closed vs. open world assumption), or assuming similar term frequencies across species. But the extent of missing data varies along other dimensions as well: for example it can depend on how easy it is to experimentally establish a particular function and how interesting the potential function might be. The problem is particularly acute in the case of negative annotations, because they can be even more difficult to establish than their positive counterparts (e.g., a negative result can also be due to inadequate experimental conditions, differences in spatiotemporal regulation, etc.) and they are often perceived as being less useful, and certainly less publishable. As a result, currently less than 1 % of all experimental annotations are negative ones in UniProt-GOA . This imbalance causes problems with training of machine learning algorithms . Rider et al.  investigated the reliability of typical machine learning evaluation metrics (area under the “receiver operating characteristic” (ROC) curve, area under the precision-recall curve) under different levels of missing negative annotations and concluded that this bias could strongly affect the ranking obtained from the different metrics. Though this particular study adopted a closed world assumption, the effect of a varying proportion of negative annotations is likely to be even greater under the open world assumption.