Best Practices in Manual Annotation with the Gene Ontology

The Gene Ontology (GO) is a framework designed to represent biological knowledge about gene prod-ucts’ biological roles and the cellular location in which they act. Biocuration is a complex process: the body of scientiﬁ c literature is large and selection of appropriate GO terms can be challenging. Both these issues are compounded by the fact that our understanding of biology is still incomplete; hence it is important to appreciate that GO is inherently an evolving model. In this chapter, we describe how biocurators create GO annotations from experimental ﬁ ndings from research articles. We describe the current best practices for high-quality literature curation and how GO curators succeed in modeling biology using a relatively simple framework. We also highlight a number of difﬁ culties when translating experimental assays into GO annotations.


Background
Biological databases have become an integral part of the tools researchers use on a daily basis for their work. GO is a controlled vocabulary for the description of biological function, and is used to annotate genes in a large number of genome and protein databases. Its computable structure makes it one of the most widely used resources. Manual annotation with GO involves biocurators, who are trained to reading, extracting, and translating experimental fi ndings from publications into GO terms. Since both the scientifi c literature and the GO are complex, novice biocurators can make errors or misinterpretations when doing annotation. Here, we present guidelines and recommendations for best practices in manual annotation, to help curators avoid the most common pitfalls. These recommendations should be useful not only to biocurators, but also to users of the GO, since the understanding of the curation process should help understand the meaning of the annotations.
Our understanding of the world is built by observation and experimentation. The overall process of the scientifi c method involves making hypotheses, deriving predictions from them, and then carrying out experiments to test the validity of these predictions. The results of the experiments are then used to infer whether the prediction was true or not [ 1 ]. Hypotheses are tested, validated, or rejected, and the combination of all the experiments contributes to uncovering the mechanism underlying the process being studied (Fig. 1 ).
Examples of experiments include testing an enzymatic activity in vitro using purifi ed reagents, measuring the expression level of a protein upon a given stimulus, or observing the phenotypes of an organism in which a gene has been deleted by molecular genetics techniques. Different inferences can be made from the same experimental setup depending on the hypothesis being tested. Thus, the conclusions that can be derived from individual experiments may vary, depending on a number of factors: they depend on the current state of knowledge, on how well controlled the experiment is, on the experimental conditions, etc. It also happens that the conclusions from a low-resolution experiment are partially or completely refuted when better techniques become available. These factors are inherent to empirical studies and must be taken into account to ensure correct interpretation of experimental results.  Fig. 1 How the scientifi c method is used to test and validate hypotheses GO is a framework to describe the roles of gene products across all living organisms [ 2 ] ( see also Chap. 2 , [ 3 ]). The ontology is divided into three branches, or aspects: Molecular Function (MF) that captures the biochemical or molecular activity of the gene product; Biological Process (BP), corresponding to the wider biological module in which the gene product's MF acts; and Cellular Component (CC), which is the specifi c cellular localization in which the gene product is active.

Knowledge Inference: General Principles
The association of a GO term and a gene product is not explicitly defi ned, but implicitly means that the gene product has an activity or a molecular role (MF term), directly participates in a process (BP), and the function takes place in a specifi c cellular localization (CC) [ 2 ]. Therefore, transient localizations such as endoplasmic reticulum and Golgi apparatus for secreted proteins are not in the scope of GO. Biological process is the most challenging aspect of the GO to capture, in part because it models two categories of processes: subtypes : "mitotic DNA replication" (GO:1902969) is a particular type of "nuclear DNA replication" (GO:0033260), and sub-processes : mitotic DNA replication is a step of the "cell cycle" (GO:0000278). These two classifi cation axes are distinguished by "is a" and "part of" relations with their parents, respectively. Gene products can be annotated using as many GO terms as necessary to completely describe its function, and the GO terms can be at varying levels in the hierarchy, depending on the evidence available. If a gene product is annotated to any particular term, then the annotations also hold for all the is-a and part-of parent terms. Annotations to more granular terms carry more information; however the annotation cannot be any deeper than what is supported by the evidence.
The complexity of biology is refl ected in the GO: with 40,000 different terms [ 4 ], learning to use the GO can be compared to learning a new language. As when learning a language, there are terms that are closely related to those we are familiar with, and others that have subtle but important differences in meaning. The GO defi nes each term in two complementary ways: fi rst by a textual defi nition intended to be human readable. Secondly, the structure of the ontology as determined by relationships of terms between each other is also a way by which terms are defi ned these can be utilized for computational reasoning.
There are two general methods for assigning GO terms to gene products. The fi rst is based on experimental evidence , and involves detailed reading of scientifi c publications to capture knowledge about gene products. Biocurators browse the GO ontologies to associate appropriate GO term(s) whose defi nition is consistent with the data published for the gene product. See Chaps. 3 [ 5 ] and 17 [ 6 ], for a description of the elements of an annotation. Expert curation based on experiments is considered the gold standard of

Methods for Assigning GO Annotations
functional annotation. It is the most reliable and provides strong support for the association of a GO term with a gene product.
The second method involves making predictions on the protein's function and subcellular localization, most often with methods relying on sequence similarity. Although not detailed in this chapter, prediction methods are highly dependent on annotations based on experiments. Indeed, all methods to assign annotations based on sequence similarity are more or less directly derived from knowledge that has been acquired experimentally; that is, at least one related protein must have been tested and shown to have a given function for that information to be propagated to other proteins. Hence, the accurate assignment of GO classes to gene products based on experimental results is crucial, since many further annotations depend on their accuracy.

Best Practices for High-Quality Manual Curation
Similar to the process by which experimental results get translated into a model of the biological phenomenon being investigated, biocurators take the conclusions from the investigation and convert it into the GO framework. Thus, the same assay may lead to different interpretations depending on the question being tested.
As shown in Table 1 , an assay must be interpreted in the wider context of the known roles of the protein, and how directly the assay assesses the protein's role in the process under investigation. Here, several experiments are described in which the readout is DNA fragmentation upon apoptotic stimulation, but that lead to different annotations. DFFB (UniProtKB O76075) is annotated to "apoptotic DNA fragmentation" (GO:0006309) because the protein is also known to be a nuclease. CYCS (UniProtKB P99999) is annotated to caspase activation ("activation of cysteine-type endopeptidase activity involved in apoptotic process" (GO:0006919)) because a direct role has been shown using an in vitro assay. However CYCS is not annotated to "apoptotic DNA fragmentation" (GO:0006309) despite the observation that removing it from cells prevents DNA fragmentation, since the activity of CYCS occurs before DNA fragmentation. Any step that takes place afterwards will inevitably fail to happen, but this does not imply participation in this downstream sequence of molecular events. Finally, the FOXL2 (UniProtKB P58012) transcription factor has a positive effect on the occurrence of apoptosis, by an unknown mechanism, so it is annotated to "positive regulation of apoptotic process" (GO:0043065). This is where the curator's knowledge is critical and provides most added value over, e.g., machine learning and text mining

GO Inference Process
With more than 500,000 records indexed yearly in PubMed, it is not possible for the GO to comprehensively represent all the available data on every protein. To address this, a careful prioritization of both articles and proteins to annotate is done. The publications from which information is drawn are selected to accurately represent the current state of knowledge. Accessory fi ndings and nonreplicated data are not systematically annotated; confi rmation or at least consistency with fi ndings from several publications is invaluable to accurately describe the function of a gene product.
Focusing on a topic allows the curator to construct a clear picture of the protein's role and makes it easier to make the best decisions when capturing biological knowledge as annotations. Reading different publications in the fi eld helps to resolve issues and select terms with more confi dence. Existing GO annotation in proteins that participate in the same biological process is also helpful to Best Practices in Manual Annotation with the Gene Ontology decide on how best to represent the experimental data with the GO. On the other hand, without the broader context of the research domain, some papers may be misleading: fi rst, as more data accumulate, a growing number of contradictory or even incorrect results are found in the scientifi c literature. Second, the way knowledge evolves occasionally obsoletes previous fi ndings. Curators use their expertise to assess the scientifi c content of articles and avoid these pitfalls [ 10 ].

Needles and Haystacks
The level of granularity of an annotation is dictated by the evidence supporting it. A good illustration is provided by ADCK3 protein in human (UniProtKB Q8NI60), an atypical kinase containing a protein kinase domain involved in the biosynthesis of ubiquinone, and an essential lipid-soluble electron transporter. Although it contains a protein kinase domain, it is unclear whether it acts as a protein kinase that phosphorylates other proteins in the CoQ complex or acts as a lipid kinase that phosphorylates a prenyl lipid in the ubiquinone biosynthesis pathway [ 11 ]. While it would be tempting to conclude that the protein has "protein kinase activity" (GO:0004672) from the presence of the protein kinase domain, the more general term "kinase activity" (GO:0016301) with no specifi cation of the potential substrate class (lipid or protein) is more appropriate.
Annotations focus on capturing experiments that are biologically relevant. Thus, substrates, tissue, or cell-type specifi city are annotated only when the data indicates the physiological importance of these parameters. One diffi culty is that it is not always possible to distinguish between experimental context and biological context, which can potentially result in GO terms being assigned as if they represented a specifi c role or under specifi c conditions, while in fact this only refl ects the experimental setup and does not have real biological signifi cance. For example, the activity of E3 ubiquitin protein ligases is commonly tested by an in vitro autoubiquitination assay. While convenient, the assay is not conclusive with respect to the "protein autoubiquitination" (GO:0051865) in vivo. In the absence of additional data, only the term "ubiquitin protein ligase activity" (GO:0061630) should be used. Similarly, the cell type in which a function was tested does not imply that the cell type is relevant for the function; any hint that the protein is studied outside its normal physiological context (such as overexpression) should be carefully taken into consideration.
Downstream effects, as well as readouts (discussed above in Subheading 2.1 ), can lead to incorrect annotations if they are directly assigned to a gene product playing a role many steps further. Here we use downstream as "occurring after," with no implication on the direct sequentiality of the events. Gene products that play housekeeping functions or function upstream of important signaling pathways have many indirect effects and pose a challenge for annotation. This can be illustrated by proteins that mediate chromatin modifi cation. Histone tails are posttranslationally modifi ed by a complex set of interdependent modifi cations. For instance, histone H2B monoubiquitination at Lys-120 (H2BK120ub) is a prerequisite for the methylation of histone H3 at Lys-4 and Lys-79 (H3K4me and H3K79me, respectively) (Fig. 2 ). RNF20 (UniProtKB Q5VTR2), an E3 ubiquitin ligase that mediates H2BK120ub, therefore indirectly promotes H3K4me and H3K79me methylation [ 12 ]. Thus, the annotation of enzymes that modify histone tails is limited to the primary function of the enzyme ("ubiquitin-protein ligase activity" (GO:0004842) and "histone H2B ubiquitination" (GO:0033523), in this case), while the further histone modifi cations are only annotated to the proteins mediating these modifi cations.

How Low Can
A similar approach is taken for cases where the experimental readout is also a GO term. Examples of this include DNA fragmentation assays to measure apoptosis, and MAPK cascade to measure the activation of an upstream pathway. Proteins that are involved in signaling leading to apoptosis do not mediate or participate in DNA fragmentation, but their addition or removal causes changes in the amount of DNA fragmentation upon apoptotic stimulation. In other words, the effect of a protein on a specifi c readout can be very indirect. Whenever possible, annotation of these very specifi c terms ("apoptotic DNA fragmentation" (GO:0006309), "MAPK cascade" (GO:0000165)) is limited to cases where there is evidence of a molecular function supporting a direct implication in the process. If that information is not available, the annotation is made to a more general term, such as "apoptotic process" (GO:0006915) or "intracellular signal transduction" (GO:0035556), for instance.
One common method to determine the function or process of a gene is mutagenesis. However, interpreting the results from mutant phenotypes is very diffi cult, as the effects caused by the absence or disruption of a gene can be very indirect. Any kind of knockout/ knockdown or "add back" experiments (in which proteins are either overexpressed or added to a cellular extract) cannot demonstrate the participation of a protein in a process, only its requirement for the process to occur. Inferring a participatory role would be an over-interpretation of the results. A striking illustration of this can be made with housekeeping genes, such as those involved in transcription and translation: knockouts in these proteins (when not lethal) can be pleiotropic and affect essentially all cellular processes. It would be both inaccurate and overwhelming for curators to annotate these gene products to every cellular process impacted. The more prior knowledge we have about a protein's function, in particular its biochemical activity, the more accurate we can be when interpreting a phenotype. Phenotypes caused by gene mutations are of great interest, not only to try to understand the function of proteins, but also to provide insights into mechanisms leading to disease. The scope of the GO, though, is to capture the normal function of proteins. There are phenotype ontologies for human-HPO [ 13 ], mouse-MP [ 14 ] and other species that allow capturing phenotype in a structure that is more relevant to this type of data.

Phenotypes
One limitation of the GO is that main functions and secondary roles are not explicitly encoded, so that this information is diffi cult to fi nd. For example, enzymes may have different substrates: in some cases, the substrate specifi city is driven by the biological context, but in other cases by the experimental conditions. While some activities represent the main function of the enzyme, others are secondary or can be limited to very specifi c conditions.
A good example is provided by the CYP4F2 enzyme (UniProtKB Q9UIU8), a member of the cytochrome P450 family that oxidizes a variety of structurally unrelated compounds, including steroids, fatty acids, and xenobiotics. In vivo, the enzyme plays a key role in vitamin K catabolism by mediating omegahydroxylation of vitamin K1 (phylloquinone), and menaquinone-4 (MK-4), a form of vitamin K2 [ 15 , 16 ]. While hydroxylation of phylloquinone and MK-4 probably constitutes the main activity of this enzyme since this activity has been confi rmed by several in vivo assays, CYP4F2 also shows activity towards other related substrates, such as arachidonic acid omega and leukotriene-B [ 10 ] omega [ 17 -21 ]. Clearly vitamin K1 and MK-4 are the main physiological substrates of CYP4F2, but since it is plausible that the enzyme also acts on other molecules, these different activities are also annotated. In the absence of additional evidence, it is currently impossible to highlight which GO term describes the in vivo function of the enzyme. For the reactions known to be implicated in vitamin K catabolism, adding this information as an annotation extension helps clarify the main role of that specifi c reaction ( see Chap. 17 , [ 6 ]).

Main Functions and Secondary Roles
Our understanding of biology is dynamic, and evolves as new experiments confi rm or contradict previous results. It is therefore essential to read several, preferably recent publications on a subject to make sure that prior working hypotheses, that have subsequently been invalidated, are not annotated. That is, sometimes it is necessary to remove annotations in order to limit the number of false positives. A number of mechanisms exist in GO to capture evolution of knowledge. New GO terms are added to the ontology when knowledge is not covered by existing GO terms. Curators work in collaboration with the GO editors, defi ning new terms or correcting the defi nitions of existing terms when required. Confl icting results can be dealt by using the "NOT" qualifi er, which states that a gene product is not associated with a GO term. This qualifi er is used when a positive association to this term could otherwise be expected from previous literature or automated methods (for more information read www.geneontology.org/ GO.annotation.conventions.shtml#not ).
A good example of how GO deals with evolving knowledge as new papers are published on a protein is provided by the recent characterization of the NOTUM protein in human and Drosophila melanogaster . Notum was fi rst characterized in D. melanogaster (UniProtKB Q9VUX3) as an inhibitor of Wnt signaling [ 22 , 23 ]. Based on its sequence similarity with pectin acetylesterase family members, it was initially thought to hydrolyze glycosaminoglycan (GAG) chains of glypicans by mediating cleavage of their GPI anchor in vitro [ 24 ]. Two different articles published recently contradict these previous results, showing that the substrate of human NOTUM (UniProtKB Q6P988) and D. melanogaster Notum is not glypicans, and that human NOTUM specifi cally mediates a palmitoleic acid modifi cation on WNT proteins [ 25 , 26 ]. This new data confi rms the role of NOTUM as an inhibitor of Wnt signaling, but with a mechanism completely different from what the initial studies had suggested. To correctly capture these fi ndings in GO, new terms describing protein depalmitoleylation were added in GO: "palmitoleyl hydrolase activity" (GO:1990699) and "protein depalmitoleylation" (GO:1990697). In addition, NOTUM proteins received negative annotations for "GPI anchor release" (GO:0006507) and "phospholipase activity" (GO:0004620) to indicate that these fi ndings had been disproven.
Although relatively infrequent, this type of situation is critical because it may affect the accuracy of the GO. Ideally, when new fi ndings invalidate previous ones, old annotations are revisited in the light of new knowledge and annotation from previous papers reevaluated to ensure that annotation was not the result of overinterpretation of data.
The most widely used manual protein annotation editor for GO, Protein2GO, has a mechanism to dispute questionable or outdated annotations that sends a request for reevaluation

Hindsight Is 20/20: Dealing with Evolving Knowledge
of annotations [ 27 ]. Users who notice incorrect or missing annotations are strongly encouraged to notify the GO helpdesk ( http://geneontology.org/form/contact-go ) so that corrections can be made.

Importance of Annotation Consistency: Toward a Quality Control Approach
The goal of the GO project is to provide a uniform schema to describe biological processes mediated by gene products in all cellular organisms [ 2 ]. Annotation involves translating conclusions from biological experiments into this schema, such that we are making inferences of inferences. To avoid deriving too much from the biologically relevant conclusions of experiments, consistent annotation within the GO framework is essential.
The GO curators make every effort to ensure that annotations refl ect the current state of knowledge. As new fi ndings are made that invalidate or refi ne existing models there is a need for course correction; otherwise both the ontology and the annotations may drift.
Over 20 groups contribute to manual annotations to the GO project ( http://geneontology.org/page/download-annotations ). The number of annotations by species, broken down into experimental versus non-experimental, is shown in Fig. 3 . Since manual annotations are so critical to the overall quality of the entire corpus of GO data, it is important that each biocurator from every contributing group interprets experiments consistently. An automated checker runs through the GO annotation rulebase ( http://geneontology.org/page/annotation-qualitycontrol-checks ), which validates the syntactic and biological content of the annotation database, and verifi es that correct procedures are followed. Examples include taxon checks [ 28 ] and checks to ensure that the correct object type is used with different types of evidence.
• The annotation team of the GO consortium also has regular annotation consistency exercises, where participating annotators independently annotate the same paper to ensure that guidelines are applied in a uniform manner, discuss any discrepancy, and update guidelines when these are lacking or need clarifi cation.
• Finally, the Reference Genome Project [ 29 ] has proven to be a very useful resource to improve annotation coherence across the GO (Feuermann et al., in preparation ). The project uses PAINT, a Phylogenetic Annotation and INference Tool, to annotate protein families from the PantherDB resource [ 30 ]. PAINT integrates phylogenetic trees, multiple sequence alignments, experimental GO annotations, as well as references pointing to the original data. PAINT curators select the highconfi dence data that can be propagated across either the entire tree or specifi c clades. By displaying different GO annotations for all members of a family, PAINT makes it easy to detect inconsistencies, thus improving the overall quality of the set of GO annotations. It also gives a mean of identifying consistent biases that usually indicate a problem in the ontology or in the annotation guidelines.

Summary
Expert curation of GO terms based on experimental data is a complex process that requires a number of skills from biocurators. In this chapter, we describe a number of guidelines to warn curators on common annotation mistakes and provide clues on how to avoid them. These simple rules, summarized in Table 2 , can be used as a checklist to ensure that GO annotations are in line with GO consortium guidelines.

Perspective
The guidelines presented here are easy to follow and reinforce curation quality without reducing curation effi ciency, which is a serious and valid challenge in the era of big data. In view of the amount of data to be dealt with, it has often been argued that manual curation "just doesn't scale," and an ongoing search for alternative methods is under way in the world of biocuration and bioinformatics.
However, examples described in this chapter show that most publications describe complex knowledge that cannot be captured by machine learning or text mining technologies. To continue having an acceptable throughput, manual curation should be able to cope with the increasing corpus of scientifi c data. From this perspective, PAINT constitutes an excellent example of a propagation tool based on experimental GO annotations, which ensures maximum consistency and effi ciency without compromising the quality of the annotations produced. Such system provides one possible answer to the concerns addressed on scalability of expert curation.
Funding Open Access charges were funded by the University College London Library, the Swiss Institute of Bioinformatics, the Agassiz Foundation, and the Foundation for the University of Lausanne.
Open Access This chapter is distributed under the terms of the Creative Commons Attribution 4.0 International License Table 2 Summary of annotation guidelines Carefully select publications . Only annotate papers that provide the most added value.

Read recent publications .
Research is not a straightforward process and reading recent publications helps resolving confl icts and detecting experimental discrepancies.
Check annotation consistency . Review the existing annotations for related proteins to see whether the annotations you are adding are consistent.
Look for confi rmation for unusual fi ndings with multiple papers , if possible . Avoid entering annotations based on experiments that do not directly implicate the protein with the GO term you annotate.
Annotate the conclusion of the experiment .
Keep in mind that this may be different from the results presented. Be especially careful of interpreting the function of proteins based on mutant phenotypes.

Remove obsolete annotations .
If you encounter an annotation that is based on an interpretation of an experiment that is no longer valid, use the Challenge mechanism or GO helpdesk to ask to have the annotation removed.
( http://creativecommons.org/licenses/by/4.0/ ), which permits use, duplication, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, a link is provided to the Creative Commons license and any changes made are indicated. The images or other third party material in this chapter are included in the work's Creative Commons license, unless indicated otherwise in the credit line; if such material is not included in the work's Creative Commons license and the respective action is not permitted by statutory regulation, users will need to obtain permission from the license holder to duplicate, adapt or reproduce the material.