Computational Methods for Annotation Transfers from Sequence
- 24k Downloads
Surveys of public sequence resources show that experimentally supported functional information is still completely missing for a considerable fraction of known proteins and is clearly incomplete for an even larger portion. Bioinformatics methods have long made use of very diverse data sources alone or in combination to predict protein function, with the understanding that different data types help elucidate complementary biological roles. This chapter focuses on methods accepting amino acid sequences as input and producing GO term assignments directly as outputs; the relevant biological and computational concepts are presented along with the advantages and limitations of individual approaches.
Key wordsProtein function prediction Homology-based annotation transfers Phylogenomics Multi-domain architecture De novo function prediction
For decades experimentalists have been painstakingly probing a range of functional aspects of individual proteins. This steady but slow acquisition of functional data is in stark contrast to the results of next-generation sequencing technologies, which can survey gene expression regulation, genomic organization, and variation on a large scale . Similarly, parallel efforts aim to map the networks of interactions between proteins, nucleic acids, and metabolites that regulate biological processes [2, 3, 4]. Nonetheless, comprehensive studies of protein function are hindered, because the combinations of gene products, biological roles, and cellular conditions are too numerous and because many experimental protocols cannot be applied to all proteins. Furthermore, the results need to be critically interpreted, integrated with existing knowledge, and translated into machine-readable formats—such as Gene Ontology (GO)  terms—for further analyses.
Given the lack of a general theory which can link protein sequences and environmental conditions directly to biological functions from physicochemical properties, current methods for protein function prediction implement knowledge-based heuristics that transfer functional information from already annotated proteins to unannotated ones. This chapter reviews sequence-based approaches to GO term prediction, which are the most popular, well understood, and easily accessible to a wide range of users. The focus is primarily on the underpinning concepts and assumptions, as well as on the known advantages and pitfalls, which are all applicable to other controlled vocabularies, such as those described in the Chap. 19  “KEGG, EC and other sources of functional data”. How well current function prediction methods perform and how prediction accuracy can be measured are topics extensively covered in the Chap. 8  “Evaluating GO annotations”, Chap. 9  “Evaluating functional annotations in enzymes”, and Chap. 10  “Community Assessment”.
2 Annotation Transfers from Homologous Proteins
The most common way to annotate uncharacterized proteins consists in finding homologues—that is, proteins sharing common ancestry—of known function, and inheriting the information available for them under the assumption that function is evolutionarily conserved. BLAST  or PSI-BLAST  are routinely used to search for homologous sequences, and tools that compare sequences against hidden Markov models (HMMs), or pairs of profiles or of HMMs can be useful to extend the coverage of the protein sequence universe thanks to the increased sensitivity for remote homologues. A detailed presentation of sequence comparison methods is beyond the scope of this chapter and is available elsewhere . In the simplest case, transfers can be made from the sequence with experimentally validated annotations and the lowest E-value—and this represents a useful baseline to benchmark the effectiveness of more advanced methods. This approach can produce erroneous results when key functional residues are mutated, or when the alignment doesn’t span the whole length of the proteins—possibly indicating changes in domain architecture . Iterative transfers of computationally generated functional assignments can lead to uncontrolled propagation of such errors; the average error rate of molecular function annotations is estimated to approach 0 % only in the manually curated UniProtKB/SwissProt database, while it is substantially higher in un-reviewed resources .
Several studies have consequently attempted to estimate sequence similarity thresholds that would generate predictions with a guaranteed level of accuracy, and have suggested that 80 % global sequence identity should be generally sufficient for safe annotation transfers [16, 17, 18, 19, 20]. However, this rule of thumb can either be too stringent or too lax, because biological sequences evolve at differing rates due to the need to maintain physiological function on the one hand, and to avoid deregulated gene expression, protein translation, folding, or physical interactions on the other . Ideally, these cutoff values should be specific to individual families or even functional categories, but usually the number of labelled examples is not sufficient to allow reliable calibration. To circumvent these issues, it is possible to trade annotation specificity for accuracy, because broad functional aspects—e.g., about ligand binding and enzymatic or transporter activities—diverge at lower rates than the fine details—such as the specific metal ions bound or the molecules and chemical groups that are recognized and processed.
GOtcha  was the first tool to make predictions representing the enrichment of the GO terms assigned to BLAST hits in the hierarchical context of GO. It first calculates weights for each GO term, taking into account the number of similar sequences annotated with it and the statistical significance of the observed similarities. The program then considers the semantic relationships among the terms to update the tallies and reflect increasing confidence in more general annotations. PFP  follows a similar approach, but targets more difficult annotation cases, too, by leveraging information from PSI-BLAST hits with unconventionally high E-values. Furthermore, the scoring scheme exploits data about the co-occurrence of GO term pairs in UniProtKB entries, which allows safer annotations to be produced. Other methods fall in this category too, and interested readers are referred to the primary literature [24, 25, 26, 27]. More sophisticated approaches rely on machine learning  rather than statistical analyses, and use experimental data to train classifiers that predict GO terms based on an array of alignment-derived features—such as sequence similarity scores, E-values, the coverage of the sequences, or the scores that GOtcha calculates for each GO category [29, 30, 31].
3 Annotation Transfers from Orthologous Proteins
Simple homology-based predictors are quick but error prone because they don’t try to distinguish functionally equivalent relatives from those that have functionally diverged. In phylogenetic terms, this problem can be cast as classifying orthologues—homologue pairs evolved after speciation—and paralogues—homologue pairs derived from gene duplication. It is widely accepted that duplicated genes lack selective pressure to maintain their original biological roles, so they can easily undergo nucleotide changes ultimately leading to functional divergence . The realization that genetic diversity arises from gene losses and horizontal transfers, too, makes phylogenetic reconstruction even more complex.
In this setup, annotations can be transferred with varying levels of confidence depending on how many orthologues there are and how closely related they are. This can partly account for the observation that orthologues can diverge functionally, particularly over long evolutionary distances or after duplication events in at least one of the lineages . However, experimental studies have also shown that paralogues can retain functional equivalence, even long after the duplication event [34, 35]. Recent studies have consequently tested how useful the distinction between orthologues and paralogues is for protein function prediction and have drawn different conclusions [36, 37, 38, 39]. The latest findings suggest that the functional similarity between orthologues is slightly higher than that between paralogues at the same level of sequence divergence, and that the signal is stronger for cellular components than for biological processes or molecular functions .
The traditional approach to orthologue detection involves computationally intensive calculations to build phylogenetic trees and then identify gene duplication and loss events . SIFTER  builds on this framework to transfer the most specific experimentally supported molecular function terms available from the annotated sequences to all nodes in the tree using a Bayesian approach. The propagation algorithm captures the notion that functional transitions are more likely to occur after duplication than after speciation events, and when the terms are similar—i.e., the corresponding nodes are close in the GO graph. In order to speed up the computation, the authors have recently suggested limiting the number of GO term annotations that can be assigned to each protein , and they are providing pre-calculated predictions for a vast set of sequences from different species, including multi-domain proteins . The semiautomated Phylogenetic Annotation and Inference Tool (PAINT)  recently adopted by the GO consortium provides a more flexible framework, which tries to keep functional change events uncoupled, so that the gain of one function does not imply the loss of another and vice versa—a desirable feature for annotating biological processes and for dealing with multifunctional proteins in general. Furthermore, unlike SIFTER, PAINT makes no assumption about how function diverges over evolutionary distance and whether its conservation is higher within orthologous groups than between them.
The increasing availability of completely sequenced genomes has promoted the development of alternative algorithms for orthologue detection. These first categorize pairs of orthologues in any two species, and then cluster the results across organisms, which helps recognize and fix spurious assignments . The results are usually made publicly available in the form of specialized databases such as EggNOG , Ensembl Compara , Inparanoid , PANTHER , PhylomeDB , and OMA , and the clustering results provide the basis for GO term annotation transfers, under the assumption that the members of an orthologous group are functionally equivalent.
4 Annotation Transfers from Protein Families
Even when the sequence similarities between proteins of interest and those that have previously been characterized are limited to specific sites, such as individual domains or motifs, they can still be useful for function prediction. Some biological activities such as molecular recognition, protein targeting, and pathway regulation have long been mechanistically linked to short linear motifs—stretches of 10–20 consecutive amino acids exposed on protein surfaces . Furthermore, some well-known protein families can be described by specific arrangements of multiple, possibly discontinuous, linear motifs, or by more general models of their domain sequences, namely sequence profiles  or hidden Markov models . Many public databases now give access to groups of evolutionarily related proteins, coding for individual domains or multi-domain architectures. Even though these resources cannot directly assign GO terms to the input amino acid sequences, they can produce valuable assignments to know protein families.
InterPro  collates such results from 11 specialized and complementary resources, which differ by the types of patterns used for family assignment, by the amount of manual curation of their contents, and by the use of additional data such as 3-D structure or phylogenetic trees. InterPro entries combine available data and organize them in a hierarchical way, which mirrors the biological relationships between families and subfamilies of proteins. The curators also enrich these annotations with supporting biological information from the scientific literature and with links to external resources such as the PDB  and GO. InterPro provides function predictions for the input sequences based on the InterPro2GO mapping, which links each protein domain family to the most specific GO terms that apply to all its members . These annotations form a large bulk of the electronically inferred functional assignments in UniProtKB, where they are integrated with associations generated from other controlled vocabularies, e.g., about subcellular localization and enzymatic activity.
CATH-Gene3D  and SUPERFAMILY  are two databases that store domain assignments for known protein sequences based on the CATH  and SCOP  protein structure classification schemes, respectively. CATH-Gene3D data are clustered into functional families which include relatives with highly similar sequences, structures, and functions, as to highlight the strong conservation of important regions such as specificity-determining residues. GO terms are associated probabilistically to each functional family based on how often they occur in the UniProtKB annotations of the whole sequences. The recent CATH FunFHMMer web server automates the search procedure for input sequences, resolves multi-domain architectures, assigns each predicted domain to its functional family, and finally inherits the GO term annotations found in the library . The dcGO—short for domain centric—method follows a similar route, but with some key differences . HMM models are built for both individual domains and supra-domains, i.e., sets of consecutive domains that are defined according to the SCOP structural definition and the evolutionary one in Pfam . Given the annotations in the GOA database  and the GO hierarchical structure, each domain and supra-domain is labelled with a set of GO terms that are associated with it in a statistically significant way. The strength of each association is then empirically converted into a confidence score. To facilitate the analysis of the results by non-specialists, the predicted GO terms are divided into four classes according to how specific and informative they are using their information content.
5 De Novo Function Annotation Using Biological Features
The function annotation methods described so far make use of homology to transfer GO terms to a target protein from other previously characterized proteins. In some cases, however, no useful functional annotations can be found for any of the detectable homologues, or in the most extreme case no homologous sequences can be found at all. In this case a de novo method is required which can infer GO terms directly from amino acid sequence in the absence of evolutionary relatedness. This is a very hard problem, and only a few tools have been developed which can handle these situations. The most successful approaches to date employ the basic idea of first transforming the target sequence into a set of component features. These features are then related to particular broad functional classes by means of supervised machine learning techniques. In this way the methods address the question of what kinds of functions can proteins perform with the given set of protein features. As a trivial example, proteins which are predicted to have particular numbers of transmembrane helices as component features will be more likely to have transmembrane transporter activity.
ProtFun, which makes use of neural networks, was the first widely used method for transferring functional annotations between human proteins through similarity of biochemical attributes, such as the occurrence of charged amino acids, low-complexity regions, signal peptides, trans-membrane helices, and posttranslationally modified residues [65, 66]. In the original ProtFun method, only the broad functional classes originally compiled by Monica Riley  were considered, but later the authors extended their approach to predicting a representative set of GO terms. FFPred, which is based on support vector machines, has taken this approach further by considering the observed strong correlation between the lengths and positions of intrinsically disordered protein regions with certain molecular functions and biological processes [68, 69]. As with ProtFun, FFPred was initially developed specifically for annotating human proteins, but the results have been shown to extend reasonably well to other vertebrate proteomes too.
Feature-based protein function assignment offers both advantages and disadvantages over sequence similarity-based approaches. The main advantage is fairly obvious: feature-based methods can work in the absence of homology to characterized proteins, and thus can even be used to assign GO terms to orphan proteins. A further advantage is that feature-based prediction is also able to provide insight into functional changes that occur after alternative splicing, as the input features are likely to reflect sequence deletions relative to the main transcript, e.g., the loss of a signal peptide or disordered region. Probably the main disadvantage is that classification models can only be built for GO terms where there are sufficient examples with experimentally validated assignments. This generally means that assignments can only be made for terms fairly high up in the overall GO graph, and thus highly specific predictions are generally not possible using this kind of approach. Of course, as datasets become larger, these methods will be able to overcome such limitation.
6 Conclusions and Outlook
The widening gap between the number of known sequences and those experimentally characterized has stimulated the development and refinement of a wide array of computational methods for protein function prediction. The scope of this survey has been limited to four classes of sequence-based approaches for GO term annotation transfers, but several other routes could be followed. If the 3-D structure of a protein has been solved or accurately modelled, it is possible to search for global or local structural similarities and predict binding regions and catalytic sites [70, 71]. Comparison of multiple complete genomes can help detect not only orthologous genes as described above, but also further patterns indicative of functional linkages between gene pairs such as fusion events, conserved chromosomal proximity, and co-occurrence/absence in a group of species . Phylogenetic profiling posits that coevolving protein families are functionally coupled, e.g., because they encode for proteins assembling into obligate complexes or participating in the same biological process. Since its inception, this “guilt-by-association” method has been implemented in several different ways , and tools able to make GO term assignments are also emerging . Involvement in the same biological process or co-localization can also be inferred from the analysis of protein-protein interaction maps, gene expression profiles, and phenotypic variations following engineered genetic mutations . Finally, integrative strategies combine all such heterogeneous data sources and hold the potential to produce more confident predictions, reduce errors, and overcome the intrinsic limitations of individual algorithms [31, 76, 77, 78]. For instance, protein sequence and structure data appear to be better suited to predict terms in the molecular function category, while genome-wide datasets can shed light on biological processes and protein subcellular localization. In the future, these methods will become increasingly valuable to generate testable hypotheses about protein function as they improve in accuracy – thanks to additional experimental data and to better ways of using them – as well as in user-friendliness to experimentalists and nonspecialists in general.
This work was partially supported by the UK Biotechnology and Biological Sciences Research Council. Open Access charges were funded by the University College London Library, the Swiss Institute of Bioinformatics, the Agassiz Foundation, and the Foundation for the University of Lausanne.
- 3.Mahony S, Pugh BF (2015) Protein-DNA binding in high-resolution. Crit Rev Biochem Mol Biol:1–15. doi:10.3109/10409238.2015.1051505Google Scholar
- 5.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25(1):25–29. doi: 10.1038/75556 CrossRefPubMedPubMedCentralGoogle Scholar
- 7.Furnham N (2016) Complementary sources of protein functional information: the far side of GO. In: Dessimoz C, Škunca N (eds) The gene ontology handbook. Methods in molecular biology, vol 1446. Humana Press. Chapter 19Google Scholar
- 8.Škunca N, Roberts RJ, Steffen M (2016) Evaluating computational gene ontology annotations. In: Dessimoz C, Škunca N (eds) The gene ontology handbook. Methods in molecular biology, vol 1446. Humana Press. Chapter 8Google Scholar
- 9.Holliday GL, Davidson R, Akiva E, Babbitt PC (2016) Evaluating functional annotations of enzymes using the gene ontology. In: Dessimoz C, Škunca N (eds) The gene ontology handbook. Methods in molecular biology, vol 1446. Humana Press. Chapter 9Google Scholar
- 10.Friedberg I, Radivojac P (2016) Community-wide evaluation of computational function prediction. In: Dessimoz C, Škunca N (eds) The gene ontology handbook. Methods in molecular biology, vol 1446. Humana Press. Chapter 10Google Scholar
- 28.Duda RO, Hart PE, Stork DG (2012) Pattern classification. Wiley, New YorkGoogle Scholar
- 35.Tischler J, Lehner B, Chen N, Fraser AG (2006) Combinatorial RNA interference in Caenorhabditis elegans reveals that redundancy between gene duplicates can be maintained for more than 80 million years of evolution. Genome Biol 7(8):R69. doi: 10.1186/gb-2006-7-8-R69 CrossRefPubMedPubMedCentralGoogle Scholar
- 45.Huerta-Cepas J, Szklarczyk D, Forslund K, Cook H, Heller D, Walter MC, Rattei T, Mende DR, Sunagawa S, Kuhn M, Jensen LJ, von Mering C, Bork P (2016) eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res 44(D1):D286–D293. doi: 10.1093/nar/gkv1248 CrossRefPubMedGoogle Scholar
- 46.Flicek P, Amode MR, Barrell D, Beal K, Billis K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fitzgerald S, Gil L, Giron CG, Gordon L, Hourlier T, Hunt S, Johnson N, Juettemann T, Kahari AK, Keenan S, Kulesha E, Martin FJ, Maurel T, McLaren WM, Murphy DN, Nag R, Overduin B, Pignatelli M, Pritchard B, Pritchard E, Riat HS, Ruffier M, Sheppard D, Taylor K, Thormann A, Trevanion SJ, Vullo A, Wilder SP, Wilson M, Zadissa A, Aken BL, Birney E, Cunningham F, Harrow J, Herrero J, Hubbard TJ, Kinsella R, Muffato M, Parker A, Spudich G, Yates A, Zerbino DR, Searle SM (2014) Ensembl 2014. Nucleic Acids Res 42(Database issue):D749–D755. doi: 10.1093/nar/gkt1196 CrossRefPubMedGoogle Scholar
- 50.Altenhoff AM, Skunca N, Glover N, Train CM, Sueki A, Pilizota I, Gori K, Tomiczek B, Muller S, Redestig H, Gonnet GH, Dessimoz C (2015) The OMA orthology database in 2015: function predictions, better plant support, synteny view and other improvements. Nucleic Acids Res 43(Database issue):D240–D249. doi: 10.1093/nar/gku1158 CrossRefPubMedGoogle Scholar
- 54.Mitchell A, Chang HY, Daugherty L, Fraser M, Hunter S, Lopez R, McAnulla C, McMenamin C, Nuka G, Pesseat S, Sangrador-Vegas A, Scheremetjew M, Rato C, Yong SY, Bateman A, Punta M, Attwood TK, Sigrist CJ, Redaschi N, Rivoire C, Xenarios I, Kahn D, Guyot D, Bork P, Letunic I, Gough J, Oates M, Haft D, Huang H, Natale DA, Wu CH, Orengo C, Sillitoe I, Mi H, Thomas PD, Finn RD (2015) The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res 43(Database issue):D213–D221. doi: 10.1093/nar/gku1243 CrossRefPubMedGoogle Scholar
- 56.Burge S, Kelly E, Lonsdale D, Mutowo-Muellenet P, McAnulla C, Mitchell A, Sangrador-Vegas A, Yong SY, Mulder N, Hunter S (2012) Manual GO annotation of predictive protein signatures: the InterPro approach to GO curation. Database (Oxford) 2012:bar068. doi: 10.1093/database/bar068
- 57.Sillitoe I, Lewis TE, Cuff A, Das S, Ashford P, Dawson NL, Furnham N, Laskowski RA, Lee D, Lees JG, Lehtinen S, Studer RA, Thornton J, Orengo CA (2015) CATH: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res 43(Database issue):D376–D381. doi: 10.1093/nar/gku947 CrossRefPubMedGoogle Scholar
- 65.Jensen LJ, Gupta R, Blom N, Devos D, Tamames J, Kesmir C, Nielsen H, Staerfeldt HH, Rapacki K, Workman C, Andersen CA, Knudsen S, Krogh A, Valencia A, Brunak S (2002) Prediction of human protein function from post-translational modifications and localization features. J Mol Biol 319(5):1257–1265. doi: 10.1016/S0022-2836(02)00379-0 CrossRefPubMedGoogle Scholar
- 72.Galperin MY, Koonin EV (2014) Comparative genomics approaches to identifying functionally related genes. In: Algorithms for computational biology. Springer, Berlin, pp 1–24Google Scholar
This chapter is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, duplication, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, a link is provided to the Creative Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative Commons license, unless indicated otherwise in the credit line; if such material is not included in the work’s Creative Commons license and the respective action is not permitted by statutory regulation, users will need to obtain permission from the license holder to duplicate, adapt or reproduce the material.