Complementary Sources of Protein Functional Information: The Far Side of GO

The GO captures many aspects of functional annotations, but there are other alternative complementary sources of protein function information. For example, enzyme functional annotations are described in a range of resources from the Enzyme Commission (E.C.) hierarchical classiﬁ cation to the Kyoto Encyclopedia of Genes and Genomes (KEGG) to the Catalytic Site Atlas amongst many others. This chapter describes some of the main resources available and how they can be used in conjunction with GO.


Introduction
The Gene Ontology (GO) offers experimental and computational biology researchers an accessible range of controlled vocabulary annotations to describe protein function. This allows detailed as well as large-scale analyses to be conducted. There is, however, a range of other sources of functional annotations, which in combination with GO provide enhance function descriptions. Examples of such complementary resources include the Enzyme Commission's classifi cation of enzyme reactions [ 1 ], the Kyoto Encyclopedia of Genes and Genomes (KEGG) [ 2 ], BRENDA [ 3 ], CSA [ 4 ], MACiE [ 5 ], MetaCyc database of enzyme and pathways [ 6 ], amongst many others. Most of these resources include GO terms within their own annotations or their defi nitions are included within the Gene Ontology. Mapping terms between resources offers enhanced descriptions and relationships between them not readily captured solely within GO. The Gene Ontology provides many of these mappings through its website ( http://geneontology.org/page/download-mappings ), which are automatically updated with various periodicities depending on how often the corresponding resource is updated. This chapter describes some of these complementary resources focusing mainly on enzymes.

Annotating Enzymes
Due to the over 100 years of experimental biochemical data, one of the richest areas for complementary functional annotations are for enzymes. Historically, naming conventions for enzymes have been confused and haphazard, with several names being given to one enzyme and one name being given to several enzymes. Often the names bear little information as to the reaction the enzyme is undertaking. This led to the development of the Enzyme Classifi cation (E.C.) system by the International Commission on Enzymes founded in 1956 by the International Union of Biochemistry [ 1 ]. The E.C. number is a hierarchal system consisting of four levels. The fi rst level has six divisions giving a broad description of the overall chemical transformation (enzyme class): Oxidoreductases, Transferases, Hydrolases, Lyases, Isomerases and Ligases. The next two levels (sub class and sub-subclass) generally describe the reactive species and the type of bond being acted upon. The meaning of these numbers is class dependent. The fi nal level is a serial number for the overall reaction of that sub-subclass. The overall reactions described are mass-balanced, as much as possible, though they are not necessarily charge-balanced, nor are they meant to represent the equilibrium position or reaction direction with a convention for writing the reaction in the same direction for all reactions within a given sub-subclass even if their physiological direction is different. General reactions, where the enzyme has broad specifi city, are given as single generic reactions and alternative reactions with specifi c metabolites are also given. Some reactions are incomplete, while others are combinations of successive reactions [ 7 ]. Thus it is possible that one enzyme E.C. number might have a multiple number of reactions associated with it and for many reactions to be assigned to the same E.C. number ( see Fig. 1a ).
Currently there are 6510 E.C. numbers approved, with 5560 of them in active use. Of these active annotations only 3924 (70 %) have an equivalent GO term. A full list of E.C. to GO crossreferences can be found on the GO website ( http://geneontology. org/external2go/ec2go ). There are a number of reasons why a mapping between E.C. and GO cannot be made. Most likely is that GO does not yet have a term that covers the EC term, e.g. E.C. 1.1.1.287 ( d -arabinitol dehydrogenase). An automatic pipeline updates the cross-reference fi le after each GO release with any new terms that are created. Other reasons why E.C. and GO terms cannot be mapped are because of E.C. entries being transferred from one term to another or the E.C. number has yet to be associated with a gene product (termed orphaned E.C. terms).
Additionally, there are "pseudo" E.C. terms created by UniProt that describe an overall reaction derived from the literature but have yet to be included in the E.C. These are easily identifi able as they have a letter n in the fourth level of the hierarchy, e.g. 1.1.1.n5 (3-methylmalate dehydrogenase).
Databases such as KEGG and BRENDA hold details of alternative reactions and data relating to physiological function. Other resources hold more specifi c functional annotations such as the catalytic residues and how they function in the overall reactions, as cataloged by the Catalytic Site Atlas (CSA), or MACiE that annotates the steps in an enzyme's reaction, the order in which bonds are broken and formed, the role of cofactors and the function of protein residues at each step. To bridge the gap between these more chemical descriptors and the biological descriptors associated with a protein a new ontology, the Enzyme Mechanism Ontology (EMO), has been developed [ 4 ]. Though not directly linked to GO, EMO terms can be determined though links with GOA terms of the UniProtKB record for a particular enzyme.

Comparing Enzyme Annotations
Unlike GO, the E.C. number cannot be used to make automated quantitative comparisons between annotations. There are a number of measures of annotation similarity that can be made based on the GO ontological graph. The most basic similarity measure is based on the length of the common path between two terms to the ontology root and has been enhanced to overcome the fact that the depth of a term within the ontology is not necessarily indicative of its specifi city, termed information content (IC). Further enhancements normalize the IC measure (Lin score) and use semantic similarity (Wang score) [ 8 , 9 ]. To overcome the defi ciencies of E.C. as a means to measure functional similarity and to capture detailed reaction information not encapsulated in GO, new methods have been developed. Efforts to compare reactions based on their overall reaction chemistry have met with only moderate success, limited by their reliance upon the consistency and reliability of the underlying reaction data and the ability of the algorithm used to process a diverse range of reactions. The latest method called EC-Blast [ 10 ] has proven more successful. It uses an atom-atom mapping approach to automatically assign bond changes and reaction centers (the atom and bond type in the immediate region of the metabolite where the bonds are broken/formed). This allows for the reaction to be described in a set of fi ngerprints that in composite can be used to compare reactions. Taking all available E.C. numbers and equivalent GO terms that can be compared to each other, the difference between the two ways of measuring functional similarity is shown in Fig. 2 . Though many comparisons result in similar scores, a substantial number diverge signifi cantly. For example, E.C. 2.1.2.9 when compared to E.C. 2.1.2.11, based on bond order changes, the similarity score as calculated by EC-Blast is 0.22, where as the semantic similarity between the equivalent GO terms is 0.73. The low similarity from EC-Blast encapsulates the differences in bonds cleaved (two C-N bonds and 2 H-N bonds for E.C. 2.1.2.9; compared to one C-C, one H-O and one C-H for E.C. 2.1.2.11 as well as differences in stereochemistry changes and bond order rearrangements.) Thus, care needs to be taken in choosing the best measure of functional similarity, a widely used technique in functional inference ( see Chap. 12 [ 26 ]).

Annotating Domains
One of the challenges of functional annotation is the granularity to which an annotation can be attached. Most genomic annotations are assigned to whole protein translations, i.e. the gene, but for many functions it is a protein domain that can be considered the functional unit. Of course functions are not solely confi ned to a single domain and many functions are a product of multiple domains in combination. Many domains are combined with others in increasingly complex combinations and arrangements ( see Fig. 3 ). This biological complexity adds considerable complexity to functional annotations, where a function can be assigned to complete gene products and other functional annotation to just one component domain or multi-domain combinations. There are a number of domain and motif databases that provide functional annotations, many of which are mapped to GO via the InterPro [ 11 ] proteins family database, that integrates predictive models from a range of different protein family databases. One of the main sequence based domain protein family databases is PFam [ 12 ], with the goal of creating a collection of functionally annotated families that is representative as much as possible of protein-sequence space. PFam curators provide functional annotations, but in recent releases these annotations have been outsourced to the community via the use of Wikipedia allowing anyone to freely edit and improve the content, with the original curator annotations maintained. By their very nature these annotations do not conform to a controlled vocabulary, but it is possible for PFam annotations to be mapped back to GO terms; this is provided by the InterPro group and is available via the GO website.

numbers have a GO term equivalents
The CATH [ 13 ] resource, which uses protein structures to defi ne domains both within known protein structures and sequences where there is no structural information, uses the GO terms associated with a sequence to defi ne functionally coherent clusters (termed FunFams) within the superfamily division of the classifi cation. The functional annotation provided is derived from the predominant GO term found within the FunFam. These terms though are assigned to the whole sequence and not the domain and therefore may not directly relate to the specifi c function the domain is participating in. In the SFLD [ 14 ] domains that are critical for function are determined (often being used to defi ne the superfamily), thereby linking the functional annotation to a domain or The graph is centered on architecture containing just the single domain with nodes ( red boxes ) radiating from this representing ever-increasing multi-domain architecture (shown to the right of the node). A key to the domains in these multi-domain architectures is shown on the left identifi ed by PFam codes (starting PF or PB) or CATH codes. Functions are associated with the whole gene product as well as for single domains within the multi-domain architecture. An interactive version of this graph can be found at http://www.funtree.info/templates/showArch.php?cathcode=00001.00010.00010.00010&cathmethod=&cathcluster=&type=AS combination of domains within a multi-domain architecture ( see Chap. 9 [ 27 ]). SUPERFAMILY [ 15 ], a domain centric resource that uses an alternative structure based domain classifi cation called SCOP, attempts to assign functional annotations specifi cally to a domain. Using the GO semantic structure and the proteins multi domain architecture, domain-centric functional annotations are statistically inferred based on the assumption that if a GO term is annotated to proteins that contain a shared domain then that term should also confer functional indicators for that domain. The SUPERFAMILY developers have generated a reduced version of GO for annotating domains and forms part of a structural domain functional ontology (SDFO) [ 16 ]. The approach of linking ontological terms to a domain can be generalized to other ontologies, most notably for phenotypic annotations. For example SUPERFAMILY integrates mammalian phenotype ontology (MPO) [ 17 ] from the mouse genome informatics (MGI) and the Human Phenotype Ontology (HPO) from the (OMIM) [ 18 ] resource.

Pathways and Interactions
Individual components of a pathway or groups of interacting proteins are described by the molecular function set of GO terms, while the pathways and interactions these components participate in are captured in the biological process GO terms. These provide overall descriptions of a biological process, such as signal transduction, or more specifi c terms such as thiamine metabolism. GO does not try to represent the dynamics or dependencies that are equivalent to a signal or metabolic pathway, though the GO consortium has recognized the importance of contextualizing gene product annotations and had begun to add some directional information ( see Chap. 17 [ 28 ]). To be able to put the components into the context of a metabolic pathway for example, the use of specialist databases such as KEGG, BioCarta, MetaCyc, Pathway Interaction Database [ 19 ] and Reactome [ 20 ] is required ( see Table 1 ). These provide curated and computationally derived descriptions of overall topologies and interactions, often displayed as pathway diagrams and maps. Many of these data resources are able to map terms back to GO. IntAct [ 21 ], which is a molecular interaction database curated from the literature or by data depositors, scores and fi lters interaction evidences to generate a high confi dence subset of molecular interactions that are exported to GO.
Combinations of GO terms and pathway/interactions databases can be used in the analysis of proteomics data for functional annotation. This can be achieved either using methods for GO enrichment analysis and subsequently linking the results to external pathway resources [ 22 ] or by dynamically constructing the pathway/interaction network based on the gene list of interest to create a functionally organized GO/pathway term network [ 23 ]. Additionally proteins participating in common biological processes or sharing molecular functions are predictive of interactions [ 24 ]. Many methods that combine semantic similarity and machine learning techniques have been developed to use GO to predict PPIs ( see ref. 25 and references therein).

Conclusions
The Gene Ontology provides a rich set of ontological terms to describe many aspects of a protein's function. Many of these terms have equivalences in more specialist resources that like the Gene Ontology collate primary data derived from the literature. Often these resources include functional annotations that are not directly captured in GO or allow for annotations to be collated around a different functional unit, as in the case of protein domain centered functional annotations. Other types of functional descriptors such as the dependencies in metabolic pathways and protein-protein interactions are not explicitly captured in GO (though this is currently being addressed through GO annotation extensions), but in combination with other resources can be used to provide and enhance functional annotation of proteins.
Open Access This chapter is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http:// creativecommons.org/licenses/by/4.0/ ), which permits use, duplication, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, a link is provided to the Creative Commons license and any changes made are indicated. The images or other third party material in this chapter are included in the work's Creative Commons license, unless indicated otherwise in the credit line; if such material is not included in the work's Creative Commons license and the respective action is not permitted by statutory regulation, users will need to obtain permission from the license holder to duplicate, adapt or reproduce the material.