Abstract
Protein function is a concept that can have different interpretations in different biological contexts, and the number and diversity of novel proteins identified by large-scale “omics” technologies poses increasingly new challenges. In this review we explore current strategies used to predict protein function focused on high-throughput sequence analysis, as for example, inference based on sequence similarity, sequence composition, structure, and protein–protein interaction. Various prediction strategies are discussed together with illustrative workflows highlighting the use of some benchmark tools and knowledge bases in the field.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Galperin MY, Makarova KS, Wolf YI, Koonin EV (2015) Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Res 43:D261–D269. doi:10.1093/nar/gku1223
Watson JD, Laskowski RA, Thornton JM (2005) Predicting protein function from sequence and structural data. Curr Opin Struct Biol 15:275–284. doi:10.1016/j.sbi.2005.04.003
Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, Verspoor K, Ben-Hur A, Pandey G, Yunes JM, Talwalkar AS, Repo S, Souza ML, Piovesan D, Casadio R, Wang Z, Cheng J, Fang H, Gough J, Koskinen P, Törönen P, Nokso-Koivisto J, Holm L, Cozzetto D, Buchan DWA, Bryson K, Jones DT, Limaye B, Inamdar H, Datta A, Manjari SK, Joshi R, Chitale M, Kihara D, Lisewski AM, Erdin S, Venner E, Lichtarge O, Rentzsch R, Yang H, Romero AE, Bhat P, Paccanaro A, Hamp T, Kaßner R, Seemayer S, Vicedo E, Schaefer C, Achten D, Auer F, Boehm A, Braun T, Hecht M, Heron M, Hönigschmid P, Hopf TA, Kaufmann S, Kiening M, Krompass D, Landerer C, Mahlich Y, Roos M, Björne J, Salakoski T, Wong A, Shatkay H, Gatzmann F, Sommer I, Wass MN, Sternberg MJE, Škunca N, Supek F, Bošnjak M, Panov P, Džeroski S, Šmuc T, Kourmpetis YAI, van Dijk ADJ, ter Braak CJF, Zhou Y, Gong Q, Dong X, Tian W, Falda M, Fontana P, Lavezzo E, Di Camillo B, Toppo S, Lan L, Djuric N, Guo Y, Vucetic S, Bairoch A, Linial M, Babbitt PC, Brenner SE, Orengo C, Rost B, Mooney SD, Friedberg I (2013) A large-scale evaluation of computational protein function prediction. Nat Methods 10:221–227. doi:10.1038/nmeth.2340
Clark WT, Radivojac P (2011) Analysis of protein function and its prediction from amino acid sequence. Proteins Struct Funct Bioinforma 79:2086–2096. doi:10.1002/prot.23029
Sahraeian SM, Luo KR, Brenner SE (2015) SIFTER search: a web server for accurate phylogeny-based protein function prediction. Nucleic Acids Res 43:W141–W147. doi:10.1093/nar/gkv461
Galperin MY, Koonin EV (2010) From complete genome sequence to “complete” understanding? Trends Biotechnol 28:398–406. doi:10.1016/j.tibtech.2010.05.006
Bork P, Dandekar T, Diaz-Lazcoz Y, Eisenhaber F, Huynen M, Yuan Y (1998) Predicting function: from genes to genomes and back. J Mol Biol 283:707–725
Punta M, Ofran Y (2008) The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function. PLoS Comput Biol 4:e1000160
Rost B (1999) Twilight zone of protein sequence alignments. Protein Eng 12:85–94
Sleator RD (2012) Prediction of protein functions. In: Kaufmann M, Klinger C (eds) Functional genomics. Springer, New York, NY, pp 15–24
Sleator RD, Walsh P (2010) An overview of in silico protein function prediction. Arch Microbiol 192:151–155. doi:10.1007/s00203-010-0549-9
Friedberg I (2006) Automated protein function prediction – the genomic challenge. Brief Bioinform 7:225–242. doi:10.1093/bib/bbl004
Lee D, Redfern O, Orengo C (2007) Predicting protein function from sequence and structure. Nat Rev Mol Cell Biol 8:995–1005. doi:10.1038/nrm2281
Khan I, Chen Y, Dong T, Hong X, Takeuchi R, Mori H, Kihara D (2014) Genome-scale identification and characterization of moonlighting proteins. Biol Direct. doi:10.1186/s13062-014-0030-9
Jeffery CJ (1999) Moonlighting proteins. Trends Biochem Sci 24:8–11
Hu P, Janga SC, Babu M, Díaz-Mejía JJ, Butland G, Yang W, Pogoutse O, Guo X, Phanse S, Wong P, Chandran S, Christopoulos C, Nazarians-Armavil A, Nasseri NK, Musso G, Ali M, Nazemof N, Eroukova V, Golshani A, Paccanaro A, Greenblatt JF, Moreno-Hagelsieb G, Emili A (2009) Global functional atlas of Escherichia coli encompassing previously uncharacterized proteins. PLoS Biol 7:e1000096. doi:10.1371/journal.pbio.1000096
Madupu R, Richter A, Dodson RJ, Brinkac L, Harkins D, Durkin S, Shrivastava S, Sutton G, Haft D (2012) CharProtDB: a database of experimentally characterized protein annotations. Nucleic Acids Res 40:D237–D241. doi:10.1093/nar/gkr1133
Finn RD, Clements J, Arndt W, Miller BL, Wheeler TJ, Schreiber F, Bateman A, Eddy SR (2015) HMMER web server: 2015 update. Nucleic Acids Res 43:W30–W38. doi:10.1093/nar/gkv397
Goodacre NF, Gerloff DL, Uetz P (2014) Protein domains of unknown function are essential in bacteria. mBio 5:e00744-13. doi:10.1128/mBio.00744-13
Bateman A, Coggill P, Finn RD (2010) DUFs: families in search of function. Acta Crystallogr Sect F Struct Biol Cryst Commun 66:1148–1152. doi:10.1107/S1744309110001685
Theißen G (2002) Orthology: secret life of genes. Nature 415:741–741. doi:10.1038/415741a
Zakon HH (2002) Convergent evolution on the molecular level. Brain Behav Evol 59:250–261
Doolittle RF (1994) Convergent evolution: the need to be explicit. Trends Biochem Sci 19:15–18. doi:10.1016/0968-0004(94)90167-8
Vazquez A, Flammini A, Maritan A, Vespignani A (2003) Global protein function prediction from protein–protein interaction networks. Nat Biotechnol 21:697–700
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2009) BLAST+: architecture and applications. BMC Bioinformatics 10:421. doi:10.1186/1471-2105-10-421
Rost B, Liu J, Nair R, Wrzeszczynski KO, Ofran Y (2003) Automatic prediction of protein function. Cell Mol Life Sci CMLS 60:2637–2650. doi:10.1007/s00018-003-3114-8
Engelhardt BE, Jordan MI, Srouji JR, Brenner SE (2011) Genome-scale phylogenetic function annotation of large and diverse protein families. Genome Res 21:1969–1980. doi:10.1101/gr.104687.109
Sigrist CJA, de Castro E, Cerutti L, Cuche BA, Hulo N, Bridge A, Bougueleret L, Xenarios I (2013) New and continuing developments at PROSITE. Nucleic Acids Res 41:D344–D347. doi:10.1093/nar/gks1067
Marchler-Bauer A, Derbyshire MK, Gonzales NR, Lu S, Chitsaz F, Geer LY, Geer RC, He J, Gwadz M, Hurwitz DI, Lanczycki CJ, Lu F, Marchler GH, Song JS, Thanki N, Wang Z, Yamashita RA, Zhang D, Zheng C, Bryant SH (2015) CDD: NCBI’s conserved domain database. Nucleic Acids Res 43:D222–D226. doi:10.1093/nar/gku1221
Attwood TK, Coletta A, Muirhead G, Pavlopoulou A, Philippou PB, Popov I, Roma-Mateo C, Theodosiou A, Mitchell AL (2012) The PRINTS database: a fine-grained protein sequence annotation and analysis resource – its status in 2012. Database 2012:bas019. doi:10.1093/database/bas019
Hawkins T, Kihara D (2007) Function prediction of uncharacterized proteins. J Bioinforma Comput Biol 5:1–30
Sillitoe I, Lewis TE, Cuff A, Das S, Ashford P, Dawson NL, Furnham N, Laskowski RA, Lee D, Lees JG, Lehtinen S, Studer RA, Thornton J, Orengo CA (2015) CATH: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res 43:D376–D381. doi:10.1093/nar/gku947
Lam SD, Dawson NL, Das S, Sillitoe I, Ashford P, Lee D, Lehtinen S, Orengo CA, Lees JG (2016) Gene3D: expanding the utility of domain assignments. Nucleic Acids Res 44:D404–D409. doi:10.1093/nar/gkv1231
Yeats C, Lees J, Carter P, Sillitoe I, Orengo C (2011) The Gene3D web services: a platform for identifying, annotating and comparing structural domains in protein sequences. Nucleic Acids Res 39:W546–W550. doi:10.1093/nar/gkr438
Holm L, Rosenstrom P (2010) Dali server: conservation mapping in 3D. Nucleic Acids Res 38:W545–W549. doi:10.1093/nar/gkq366
Gibrat JF, Madej T, Bryant SH (1996) Surprising similarities in structure comparison. Curr Opin Struct Biol 6:377–385
Shindyalov IN, Bourne PE (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng 11:739–747
Goldberg T, Hecht M, Hamp T, Karl T, Yachdav G, Ahmed N, Altermann U, Angerer P, Ansorge S, Balasz K, Bernhofer M, Betz A, Cizmadija L, Do KT, Gerke J, Greil R, Joerdens V, Hastreiter M, Hembach K, Herzog M, Kalemanov M, Kluge M, Meier A, Nasir H, Neumaier U, Prade V, Reeb J, Sorokoumov A, Troshani I, Vorberg S, Waldraff S, Zierer J, Nielsen H, Rost B (2014) LocTree3 prediction of localization. Nucleic Acids Res 42:W350–W355. doi:10.1093/nar/gku396
Pierleoni A, Martelli PL, Fariselli P, Casadio R (2006) BaCelLo: a balanced subcellular localization predictor. Bioinformatics 22:e408–e416. doi:10.1093/bioinformatics/btl222
Emanuelsson O, Nielsen H, Brunak S, von Heijne G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300:1005–1016. doi:10.1006/jmbi.2000.3903
Boden M, Hawkins J (2005) Prediction of subcellular localization using sequence-biased recurrent networks. Bioinformatics 21:2279–2286. doi:10.1093/bioinformatics/bti372
Krogh A, Larsson B, von Heijne G, Sonnhammer EL (2001) Predicting transmembrane protein topology with a hidden markov model: application to complete genomes 11 Edited by F. Cohen. J Mol Biol 305:567–580. doi:10.1006/jmbi.2000.4315
Juncker AS, Willenbrock H, von Heijne G, Brunak S, Nielsen H, Krogh A (2003) Prediction of lipoprotein signal peptides in gram-negative bacteria. Protein Sci 12:1652–1662. doi:10.1110/ps.0303703
Bendtsen JD, Nielsen H, Widdick D, Palmer T, Brunak S (2005) Prediction of twin-arginine signal peptides. BMC Bioinformatics 6:167
du Plessis L, Skunca N, Dessimoz C (2011) The what, where, how and why of gene ontology – a primer for bioinformaticians. Brief Bioinform 12:723–735. doi:10.1093/bib/bbr002
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25:25–29
Lesk AM (2010) Introduction to protein science: architecture, function, and genomics, 2nd edn. Oxford University Press, Oxford
Saier MH (2006) TCDB: the transporter classification database for membrane transport protein analyses and information. Nucleic Acids Res 34:D181–D186. doi:10.1093/nar/gkj001
Huang DW, Sherman BT, Lempicki RA (2008) Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4:44–57. doi:10.1038/nprot.2008.211
Martin DM, Berriman M, Barton GJ (2004) GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 5:178. doi:10.1186/1471-2105-5-178
Hawkins T, Luban S, Kihara D (2006) Enhanced automated function prediction using distantly related sequences and contextual association by PFP. Protein Sci 15:1550–1556. doi:10.1110/ps.062153506
Wass MN, Barton G, Sternberg MJE (2012) CombFunc: predicting protein function using heterogeneous data sources. Nucleic Acids Res 40:W466–W470. doi:10.1093/nar/gks489
Jensen LJ, Gupta R, Blom N, Devos D, Tamames J, Kesmir C, Nielsen H, Stærfeldt HH, Rapacki K, Workman C, Andersen CAF, Knudsen S, Krogh A, Valencia A, Brunak S (2002) Prediction of human protein function from post-translational modifications and localization features. J Mol Biol 319:1257–1265. doi:10.1016/S0022-2836(02)00379-0
Huerta-Cepas J, Szklarczyk D, Forslund K, Cook H, Heller D, Walter MC, Rattei T, Mende DR, Sunagawa S, Kuhn M, Jensen LJ, von Mering C, Bork P (2016) eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res 44:D286–D293. doi:10.1093/nar/gkv1248
Mi H (2004) The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Res 33:D284–D288. doi:10.1093/nar/gki078
Lombard V, Golaconda Ramulu H, Drula E, Coutinho PM, Henrissat B (2014) The carbohydrate-active enzymes database (CAZy) in 2013. Nucleic Acids Res 42:D490–D495. doi:10.1093/nar/gkt1178
Wagner GP, Pavlicev M, Cheverud JM (2007) The road to modularity. Nat Rev Genet 8:921–931. doi:10.1038/nrg2267
Pereira-Leal JB, Levy ED, Teichmann SA (2006) The origins and evolution of functional modules: lessons from protein complexes. Philos Trans R Soc B Biol Sci 361:507–517. doi:10.1098/rstb.2005.1807
Osterman A, Overbeek R (2003) Missing genes in metabolic pathways: a comparative genomics approach. Curr Opin Chem Biol 7:238–251. doi:10.1016/S1367-5931(03)00027-9
Kensche PR, van Noort V, Dutilh BE, Huynen MA (2008) Practical and theoretical advances in predicting the function of a protein by its phylogenetic distribution. J R Soc Interface 5:151–170. doi:10.1098/rsif.2007.1047
Eisen JA (1998) Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res 8:163–167. doi:10.1101/gr.8.3.163
Kanehisa M, Sato Y, Kawashima M, Furumichi M, Tanabe M (2016) KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res 44:D457–D462. doi:10.1093/nar/gkv1070
Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, Huerta-Cepas J, Simonovic M, Roth A, Santos A, Tsafou KP, Kuhn M, Bork P, Jensen LJ, von Mering C (2015) STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res 43:D447–D452. doi:10.1093/nar/gku1003
Hishigaki H, Nakai K, Ono T, Tanigami A, Takagi T (2001) Assessment of prediction accuracy of protein function from protein–protein interaction data. Yeast 18:523–531
Mayer ML, Hieter P (2000) Protein networks—built by association. Nat Biotechnol 18:1242–1243. doi:10.1038/82342
Sharan R, Ulitsky I, Shamir R (2007) Network-based prediction of protein function. Mol Syst Biol. doi:10.1038/msb4100129
Engelhardt BE, Jordan MI, Muratore KE, Brenner SE (2005) Protein molecular function prediction by Bayesian Phylogenomics. PLoS Comput Biol 1:e45. doi:10.1371/journal.pcbi.0010045
Rodrigues BN, Steffens MBR, Raittz RT, Santos-Weiss ICR, Marchaukoski JN (2015) Quantitative assessment of protein function prediction programs. Genet Mol Res 14:17555–17566. doi:10.4238/2015.December.21.28
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media LLC
About this protocol
Cite this protocol
Cruz, L.M., Trefflich, S., Weiss, V.A., Castro, M.A.A. (2017). Protein Function Prediction. In: Kaufmann, M., Klinger, C., Savelsbergh, A. (eds) Functional Genomics. Methods in Molecular Biology, vol 1654. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-7231-9_5
Download citation
DOI: https://doi.org/10.1007/978-1-4939-7231-9_5
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-7230-2
Online ISBN: 978-1-4939-7231-9
eBook Packages: Springer Protocols