Skip to main content

The Recipe for Protein Sequence-Based Function Prediction and Its Implementation in the ANNOTATOR Software Environment

Part of the Methods in Molecular Biology book series (MIMB,volume 1415)

Abstract

As biomolecular sequencing is becoming the main technique in life sciences, functional interpretation of sequences in terms of biomolecular mechanisms with in silico approaches is getting increasingly significant. Function prediction tools are most powerful for protein-coding sequences; yet, the concepts and technologies used for this purpose are not well reflected in bioinformatics textbooks. Notably, protein sequences typically consist of globular domains and non-globular segments. The two types of regions require cardinally different approaches for function prediction. Whereas the former are classic targets for homology-inspired function transfer based on remnant, yet statistically significant sequence similarity to other, characterized sequences, the latter type of regions are characterized by compositional bias or simple, repetitive patterns and require lexical analysis and/or empirical sequence pattern–function correlations. The recipe for function prediction recommends first to find all types of non-globular segments and, then, to subject the remaining query sequence to sequence similarity searches. We provide an updated description of the ANNOTATOR software environment as an advanced example of a software platform that facilitates protein sequence-based function prediction.

Key words

  • Protein sequence analysis
  • Protein function prediction
  • Globular domain
  • Non-globular segment
  • Genome annotation
  • ANNOTATOR

This is a preview of subscription content, access via your institution.

Buying options

Protocol
USD   49.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-1-4939-3572-7_25
  • Chapter length: 30 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   119.00
Price excludes VAT (USA)
  • ISBN: 978-1-4939-3572-7
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   159.99
Price excludes VAT (USA)
Hardcover Book
USD   219.99
Price excludes VAT (USA)

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

References

  1. Eisenhaber F (2012) A decade after the first full human genome sequencing: when will we understand our own genome? J Bioinform Comput Biol 10:1271001

    PubMed  CrossRef  Google Scholar 

  2. Kuznetsov V, Lee HK, Maurer-Stroh S, Molnar MJ, Pongor S, Eisenhaber B, Eisenhaber F (2013) How bioinformatics influences health informatics: usage of biomolecular sequences, expression profiles and automated microscopic image analyses for clinical needs and public health. Health Inf Sci Syst 1:2

    PubMed  PubMed Central  CrossRef  Google Scholar 

  3. Eisenhaber F, Sung WK, Wong L (2013) The 24th International Conference on Genome Informatics, GIW2013, in Singapore. J Bioinform Comput Biol 11:1302003

    CrossRef  Google Scholar 

  4. Pena-Castillo L, Hughes TR (2007) Why are there still over 1000 uncharacterized yeast genes? Genetics 176:7–14

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  5. Bork P, Dandekar T, az-Lazcoz Y, Eisenhaber F, Huynen M, Yuan Y (1998) Predicting function: from genes to genomes and back. J Mol Biol 283:707–725

    CAS  PubMed  CrossRef  Google Scholar 

  6. Schneider G, Neuberger G, Wildpaner M, Tian S, Berezovsky I, Eisenhaber F (2006) Application of a sensitive collection heuristic for very large protein families: evolutionary relationship between adipose triglyceride lipase (ATGL) and classic mammalian lipases. BMC Bioinformatics 7:164

    PubMed  PubMed Central  CrossRef  CAS  Google Scholar 

  7. Eisenhaber F (2006) Bioinformatics: mystery, astrology or service technology. In: Eisenhaber F (ed) Preface for “Discovering Biomolecular Mechanisms with Computational Biology”, 1st edn. Landes Biosciences and Eurekah.com, Georgetown, pp 1–10

    CrossRef  Google Scholar 

  8. Eisenhaber B, Eisenhaber S, Kwang TY, Gruber G, Eisenhaber F (2014) Transamidase subunit GAA1/GPAA1 is a M28 family metallo-peptide-synthetase that catalyzes the peptide bond formation between the substrate protein’s omega-site and the GPI lipid anchor’s phosphoethanolamine. Cell Cycle 13:1912–1917

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  9. Kinoshita T (2014) Enzymatic mechanism of GPI anchor attachment clarified. Cell Cycle 13:1838–1839

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  10. Novatchkova M, Bachmair A, Eisenhaber B, Eisenhaber F (2005) Proteins with two SUMO-like domains in chromatin-associated complexes: the RENi (Rad60-Esc2-NIP45) family. BMC Bioinformatics 6:22

    PubMed  PubMed Central  CrossRef  CAS  Google Scholar 

  11. Panizza S, Tanaka T, Hochwagen A, Eisenhaber F, Nasmyth K (2000) Pds5 cooperates with cohesin in maintaining sister chromatid cohesion. Curr Biol 10:1557–1564

    CAS  PubMed  CrossRef  Google Scholar 

  12. Prokesch A, Bogner-Strauss JG, Hackl H, Rieder D, Neuhold C, Walenta E, Krogsdam A, Scheideler M, Papak C, Wong WC et al (2011) Arxes: retrotransposed genes required for adipogenesis. Nucleic Acids Res 39:3224–3239

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  13. Schneider G, Sherman W, Kuchibhatla D, Ooi HS, Sirota FL, Maurer-Stroh S, Eisenhaber B, Eisenhaber F (2012) Protein sequence-structure-function-network links discovered with the ANNOTATOR software suite: application to Elys/Mel-28. In: Trajanoski Z (ed) Computational medicine. Springer, Vienna, pp 111–143

    CrossRef  Google Scholar 

  14. Schneider G, Wildpaner M, Sirota FL, Maurer-Stroh S, Eisenhaber B, Eisenhaber F (2010) Integrated tools for biomolecular sequence-based function prediction as exemplified by the ANNOTATOR software environment. Methods Mol Biol 609:257–267

    CAS  PubMed  CrossRef  Google Scholar 

  15. Ooi HS, Kwo CY, Wildpaner M, Sirota FL, Eisenhaber B, Maurer-Stroh S, Wong WC, Schleiffer A, Eisenhaber F, Schneider G (2009) ANNIE: integrated de novo protein sequence annotation. Nucleic Acids Res 37:W435–W440

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  16. Sherman W, Kuchibhatla D, Limviphuvadh V, Maurer-Stroh S, Eisenhaber B, Eisenhaber F (2015) HPMV: Human protein mutation viewer—relating sequence mutations to protein sequence architecture and function changes. J Bioinform Comput Biol 13 (in press)

    Google Scholar 

  17. Eisenhaber F, Bork P (1998) Sequence and structure of proteins. In: Schomburg D (ed) Recombinant proteins, monoclonal antibodies and therapeutic genes. Wiley-VCH, Weinheim, pp 43–86

    Google Scholar 

  18. Eisenhaber B, Eisenhaber F, Maurer-Stroh S, Neuberger G (2004) Prediction of sequence signals for lipid post-translational modifications: insights from case studies. Proteomics 4:1614–1625

    CAS  PubMed  CrossRef  Google Scholar 

  19. Eisenhaber B, Eisenhaber F (2005) Sequence complexity of proteins and its significance in annotation. In: Subramaniam S (ed) “Bioinformatics” in the encyclopedia of genetics, genomics, proteomics and bioinformatics. Wiley Interscience, New York. doi:10.1002/047001153X.g403313

    Google Scholar 

  20. Eisenhaber B, Eisenhaber F (2007) Posttranslational modifications and subcellular localization signals: indicators of sequence regions without inherent 3D structure? Curr Protein Pept Sci 8:197–203

    CAS  PubMed  CrossRef  Google Scholar 

  21. Eisenhaber F (2006) Prediction of protein function: two basic concepts and one practical recipe (Chapter 3). In: Eisenhaber F (ed) Discovering biomolecular mechanisms with computational biology, 1st edn. Landes Biosciences and Eurekah.com, Georgetown, pp 39–54

    CrossRef  Google Scholar 

  22. Wong WC, Maurer-Stroh S, Eisenhaber F (2010) More than 1,001 problems with protein domain databases: transmembrane regions, signal peptides and the issue of sequence homology. PLoS Comput Biol 6:e1000867

    PubMed  PubMed Central  CrossRef  CAS  Google Scholar 

  23. Wong WC, Maurer-Stroh S, Eisenhaber F (2011) Not all transmembrane helices are born equal: towards the extension of the sequence homology concept to membrane proteins. Biol Direct 6:57

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  24. Sirota FL, Maurer-Stroh S, Eisenhaber B, Eisenhaber F (2015) Single-residue posttranslational modification sites at the N-terminus, C-terminus or in-between: to be or not to be exposed for enzyme access. Proteomics 15:2525–2546

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  25. Eisenhaber F, Wechselberger C, Kreil G (2001) The Brix domain protein family -- a key to the ribosomal biogenesis pathway? Trends Biochem Sci 26:345–347

    CAS  PubMed  CrossRef  Google Scholar 

  26. Maurer-Stroh S, Dickens NJ, Hughes-Davies L, Kouzarides T, Eisenhaber F, Ponting CP (2003) The Tudor domain ‘Royal Family’: Tudor, plant Agenet, Chromo PWWP and MBT domains. Trends Biochem Sci 28:69–74

    CAS  PubMed  CrossRef  Google Scholar 

  27. Novatchkova M, Leibbrandt A, Werzowa J, Neubuser A, Eisenhaber F (2003) The STIR-domain superfamily in signal transduction, development and immunity. Trends Biochem Sci 28:226–229

    CAS  PubMed  CrossRef  Google Scholar 

  28. Novatchkova M, Eisenhaber F (2004) Linking transcriptional mediators via the GACKIX domain super family. Curr Biol 14:R54–R55

    CAS  PubMed  CrossRef  Google Scholar 

  29. Bogner-Strauss JG, Prokesch A, Sanchez-Cabo F, Rieder D, Hackl H, Duszka K, Krogsdam A, Di CB, Walenta E, Klatzer A et al (2010) Reconstruction of gene association network reveals a transmembrane protein required for adipogenesis and targeted by PPARgamma. Cell Mol Life Sci 67:4049–4064

    CAS  PubMed  CrossRef  Google Scholar 

  30. Maurer-Stroh S, Ma J, Lee RT, Sirota FL, Eisenhaber F (2009) Mapping the sequence mutations of the 2009 H1N1 influenza A virus neuraminidase relative to drug and antibody binding sites. Biol Direct 4:18

    PubMed  PubMed Central  CrossRef  CAS  Google Scholar 

  31. Vodermaier HC, Gieffers C, Maurer-Stroh S, Eisenhaber F, Peters JM (2003) TPR subunits of the anaphase-promoting complex mediate binding to the activator protein CDH1. Curr Biol 13:1459–1468

    CAS  PubMed  CrossRef  Google Scholar 

  32. Eswar N, Webb B, Marti-Renom MA, Madhusudhan MS, Eramian D, Shen MY, Pieper U, Sali A (2006) Comparative protein structure modeling using Modeller. Curr Protoc Bioinformatics Chapter 5, Unit 5.6

    Google Scholar 

  33. Eswar N, Webb B, Marti-Renom MA, Madhusudhan MS, Eramian D, Shen MY, Pieper U, Sali A (2007) Comparative protein structure modeling using MODELLER. Curr Protoc Protein Sci Chapter 2, Unit 2.9

    Google Scholar 

  34. Fiser A, Do RK, Sali A (2000) Modeling of loops in protein structures. Protein Sci 9:1753–1773

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  35. Sali A, Blundell TL (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234:779–815

    CAS  PubMed  CrossRef  Google Scholar 

  36. Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14:755–763

    CAS  PubMed  CrossRef  Google Scholar 

  37. Eddy SR (2011) Accelerated profile HMM searches. PLoS Comput Biol 7:e1002195

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  38. Marchler-Bauer A, Lu S, Anderson JB, Chitsaz F, Derbyshire MK, Weese-Scott C, Fong JH, Geer LY, Geer RC, Gonzales NR et al (2011) CDD: a Conserved Domain Database for the functional annotation of proteins. Nucleic Acids Res 39:D225–D229

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  39. Schaffer AA, Wolf YI, Ponting CP, Koonin EV, Aravind L, Altschul SF (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15:1000–1011

    CAS  PubMed  CrossRef  Google Scholar 

  40. Remmert M, Biegert A, Hauser A, Soding J (2012) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9:173–175

    CAS  CrossRef  Google Scholar 

  41. Soding J, Biegert A, Lupas AN (2005) The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res 33:W244–W248

    PubMed  PubMed Central  CrossRef  CAS  Google Scholar 

  42. Soding J (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21:951–960

    PubMed  CrossRef  Google Scholar 

  43. Wong WC, Maurer-Stroh S, Eisenhaber F (2011) The Janus-faced E-values of HMMER2: extreme value distribution or logistic function? J Bioinform Comput Biol 9:179–206

    PubMed  CrossRef  Google Scholar 

  44. Wong WC, Maurer-Stroh S, Eisenhaber B, Eisenhaber F (2014) On the necessity of dissecting sequence similarity scores into segment-specific contributions for inferring protein homology, function prediction and annotation. BMC Bioinformatics 15:166

    PubMed  PubMed Central  CrossRef  CAS  Google Scholar 

  45. Wong WC, Yap CK, Eisenhaber B, Eisenhaber F (2015) dissectHMMER: a HMMER-based score dissection framework that statistically evaluates fold-critical sequence segments for domain fold similarity. Biol Direct 10:39

    PubMed  PubMed Central  CrossRef  CAS  Google Scholar 

  46. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  47. Wong WC, Maurer-Stroh S, Schneider G, Eisenhaber F (2012) Transmembrane helix: simple or complex. Nucleic Acids Res 40:W370–W375

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  48. Kreil DP, Ouzounis CA (2003) Comparison of sequence masking algorithms and the detection of biased protein sequence regions. Bioinformatics 19:1672–1681

    CAS  PubMed  CrossRef  Google Scholar 

  49. Promponas VJ, Enright AJ, Tsoka S, Kreil DP, Leroy C, Hamodrakas S, Sander C, Ouzounis CA (2000) CAST: an iterative algorithm for the complexity analysis of sequence tracts. Complexity analysis of sequence tracts. Bioinformatics 16:915–922

    CAS  PubMed  CrossRef  Google Scholar 

  50. Iakoucheva LM, Dunker AK (2003) Order, disorder, and flexibility: prediction from protein sequence. Structure 11:1316–1317

    CAS  PubMed  CrossRef  Google Scholar 

  51. Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB (2003) Protein disorder prediction: implications for structural proteomics. Structure 11:1453–1459

    CAS  PubMed  CrossRef  Google Scholar 

  52. Linding R, Russell RB, Neduva V, Gibson TJ (2003) GlobPlot: exploring protein sequences for globularity and disorder. Nucleic Acids Res 31:3701–3708

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  53. Dosztanyi Z, Csizmok V, Tompa P, Simon I (2005) IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21:3433–3434

    CAS  PubMed  CrossRef  Google Scholar 

  54. Dosztanyi Z, Csizmok V, Tompa P, Simon I (2005) The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J Mol Biol 347:827–839

    CAS  PubMed  CrossRef  Google Scholar 

  55. Brendel V, Bucher P, Nourbakhsh IR, Blaisdell BE, Karlin S (1992) Methods and algorithms for statistical analysis of protein sequences. Proc Natl Acad Sci U S A 89:2002–2006

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  56. Claverie JM (1994) Large scale sequence analysis. In: Adams MD, Fields C, Venter JC (eds.), Automated DNA sequencing and analysis. Academic Press, San Diego, pp. 267–279.

    Google Scholar 

  57. Claverie JM, States DJ (1993) Information enhancement methods for large scale sequence analysis. Comput Chem 17:191–201

    CAS  CrossRef  Google Scholar 

  58. Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT (2004) Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337:635–645

    CAS  PubMed  CrossRef  Google Scholar 

  59. Wootton JC, Federhen S (1993) Statistics of local complexity in amino acid sequences and sequence databases. Comput Chem 17:149–163

    CAS  CrossRef  Google Scholar 

  60. Wootton JC (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 18:269–285

    CAS  PubMed  CrossRef  Google Scholar 

  61. Wootton JC (1994) Sequences with “unusual” amino acid compositions. Curr Opin Struct Biol 4:413–421

    CAS  CrossRef  Google Scholar 

  62. Wootton JC, Federhen S (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol 266:554–571

    CAS  PubMed  CrossRef  Google Scholar 

  63. Eisenhaber B, Bork P, Eisenhaber F (1999) Prediction of potential GPI-modification sites in proprotein sequences. J Mol Biol 292:741–758

    CAS  PubMed  CrossRef  Google Scholar 

  64. Eisenhaber B, Wildpaner M, Schultz CJ, Borner GH, Dupree P, Eisenhaber F (2003) Glycosylphosphatidylinositol lipid anchoring of plant proteins. Sensitive prediction from sequence- and genome-wide studies for Arabidopsis and rice. Plant Physiol 133:1691–1701

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  65. Eisenhaber B, Maurer-Stroh S, Novatchkova M, Schneider G, Eisenhaber F (2003) Enzymes and auxiliary factors for GPI lipid anchor biosynthesis and post-translational transfer to proteins. Bioessays 25:367–385

    CAS  PubMed  CrossRef  Google Scholar 

  66. Eisenhaber B, Schneider G, Wildpaner M, Eisenhaber F (2004) A sensitive predictor for potential GPI lipid modification sites in fungal protein sequences and its application to genome-wide studies for Aspergillus nidulans, Candida albicans, Neurospora crassa, Saccharomyces cerevisiae and Schizosaccharomyces pombe. J Mol Biol 337:243–253

    CAS  PubMed  CrossRef  Google Scholar 

  67. Maurer-Stroh S, Eisenhaber B, Eisenhaber F (2002) N-terminal N-myristoylation of proteins: prediction of substrate proteins from amino acid sequence. J Mol Biol 317:541–557

    CAS  PubMed  CrossRef  Google Scholar 

  68. Maurer-Stroh S, Eisenhaber B, Eisenhaber F (2002) N-terminal N-myristoylation of proteins: refinement of the sequence motif and its taxon-specific differences. J Mol Biol 317:523–540

    CAS  PubMed  CrossRef  Google Scholar 

  69. Maurer-Stroh S, Gouda M, Novatchkova M, Schleiffer A, Schneider G, Sirota FL, Wildpaner M, Hayashi N, Eisenhaber F (2004) MYRbase: analysis of genome-wide glycine myristoylation enlarges the functional spectrum of eukaryotic myristoylated proteins. Genome Biol 5:R21

    PubMed  PubMed Central  CrossRef  Google Scholar 

  70. Maurer-Stroh S, Eisenhaber F (2004) Myristoylation of viral and bacterial proteins. Trends Microbiol 12:178–185

    CAS  PubMed  CrossRef  Google Scholar 

  71. Maurer-Stroh S, Washietl S, Eisenhaber F (2003) Protein prenyltransferases. Genome Biol 4:212

    PubMed  PubMed Central  CrossRef  Google Scholar 

  72. Maurer-Stroh S, Eisenhaber F (2005) Refinement and prediction of protein prenylation motifs. Genome Biol 6:R55

    PubMed  PubMed Central  CrossRef  CAS  Google Scholar 

  73. Maurer-Stroh S, Koranda M, Benetka W, Schneider G, Sirota FL, Eisenhaber F (2007) Towards complete sets of farnesylated and geranylgeranylated proteins. PLoS Comput Biol 3, e66

    PubMed  PubMed Central  CrossRef  CAS  Google Scholar 

  74. Neuberger G, Maurer-Stroh S, Eisenhaber B, Hartig A, Eisenhaber F (2003) Prediction of peroxisomal targeting signal 1 containing proteins from amino acid sequence. J Mol Biol 328:581–592

    CAS  PubMed  CrossRef  Google Scholar 

  75. Neuberger G, Maurer-Stroh S, Eisenhaber B, Hartig A, Eisenhaber F (2003) Motif refinement of the peroxisomal targeting signal 1 and evaluation of taxon-specific differences. J Mol Biol 328:567–579

    CAS  PubMed  CrossRef  Google Scholar 

  76. von Heijne G (1986) A new method for predicting signal sequence cleavage sites. Nucleic Acids Res 14:4683–4690

    CrossRef  Google Scholar 

  77. von Heijne G (1987) Sequence analysis in molecular biology? Treasure trove or trivial pursuit. Academic, San Diego

    Google Scholar 

  78. Bendtsen JD, Nielsen H, von Heijne G, Brunak S (2004) Improved prediction of signal peptides: SignalP 3.0. J Mol Biol 340:783–795

    PubMed  CrossRef  CAS  Google Scholar 

  79. Nielsen H, Engelbrecht J, Brunak S, von HG (1997) Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng 10:1–6

    CAS  PubMed  CrossRef  Google Scholar 

  80. Nielsen H, Krogh A (1998) Prediction of signal peptides and signal anchors by a hidden Markov model. Proc Int Conf Intell Syst Mol Biol 6:122–130

    CAS  PubMed  Google Scholar 

  81. Cserzo M, Eisenhaber F, Eisenhaber B, Simon I (2002) On filtering false positive transmembrane protein predictions. Protein Eng 15:745–752

    CAS  PubMed  CrossRef  Google Scholar 

  82. Cserzo M, Eisenhaber F, Eisenhaber B, Simon I (2004) TM or not TM: transmembrane protein prediction with low false positive rate using DAS-TMfilter. Bioinformatics 20:136–137

    CAS  PubMed  CrossRef  Google Scholar 

  83. Tusnady GE, Simon I (1998) Principles governing amino acid composition of integral membrane proteins: application to topology prediction. J Mol Biol 283:489–506

    CAS  PubMed  CrossRef  Google Scholar 

  84. Kall L, Krogh A, Sonnhammer EL (2004) A combined transmembrane topology and signal peptide prediction method. J Mol Biol 338:1027–1036

    CAS  PubMed  CrossRef  Google Scholar 

  85. Kall L, Krogh A, Sonnhammer EL (2007) Advantages of combined transmembrane topology and signal peptide prediction--the Phobius web server. Nucleic Acids Res 35:W429–W432

    PubMed  PubMed Central  CrossRef  Google Scholar 

  86. Krogh A, Larsson B, von HG, Sonnhammer EL (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305:567–580

    CAS  PubMed  CrossRef  Google Scholar 

  87. Sonnhammer EL, Von HG, Krogh A (1998) A hidden Markov model for predicting transmembrane helices in protein sequences. Proc Int Conf Intell Syst Mol Biol 6:175–182

    CAS  PubMed  Google Scholar 

  88. Claros MG, von Heijne G (1994) TopPred II: an improved software for membrane protein structure predictions. Comput Appl Biosci 10:685–686

    CAS  PubMed  Google Scholar 

  89. von Heijne G (1992) Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. J Mol Biol 225:487–494

    CrossRef  Google Scholar 

  90. Lupas A, Van DM, Stock J (1991) Predicting coiled coils from protein sequences. Science 252:1162–1164

    CAS  PubMed  CrossRef  Google Scholar 

  91. Lupas A (1996) Prediction and analysis of coiled-coil structures. Methods Enzymol 266:513–525

    CAS  PubMed  CrossRef  Google Scholar 

  92. Frishman D, Argos P (1996) Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence. Protein Eng 9:133–142

    CAS  PubMed  CrossRef  Google Scholar 

  93. Frishman D, Argos P (1997) Seventy-five percent accuracy in protein secondary structure prediction. Proteins 27:329–335

    CAS  PubMed  CrossRef  Google Scholar 

  94. Eisenhaber F, Imperiale F, Argos P, Frommel C (1996) Prediction of secondary structural content of proteins from their amino acid composition alone. I New analytic vector decomposition methods. Proteins 25:157–168

    CAS  PubMed  CrossRef  Google Scholar 

  95. Eisenhaber F, Frommel C, Argos P (1996) Prediction of secondary structural content of proteins from their amino acid composition alone. II The paradox with secondary structural class. Proteins 25:169–179

    CAS  PubMed  CrossRef  Google Scholar 

  96. Maurer-Stroh S, Gao H, Han H, Baeten L, Schymkowitz J, Rousseau F, Zhang L, Eisenhaber F (2013) Motif discovery with data mining in 3D protein structure databases: discovery, validation and prediction of the U-shape zinc binding (“Huf-Zinc”) motif. J Bioinform Comput Biol 11:1340008

    PubMed  CrossRef  CAS  Google Scholar 

  97. Andrade MA, Ponting CP, Gibson TJ, Bork P (2000) Homology-based method for identification of protein repeats using statistical significance estimates. J Mol Biol 298:521–537

    CAS  PubMed  CrossRef  Google Scholar 

  98. Andrade MA, Petosa C, O’Donoghue SI, Muller CW, Bork P (2001) Comparison of ARM and HEAT protein repeats. J Mol Biol 309:1–18

    CAS  PubMed  CrossRef  Google Scholar 

  99. Medema MH, Blin K, Cimermancic P, de Jager V, Zakrzewski P, Fischbach MA, Weber T, Takano E, Breitling R (2011) antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences. Nucleic Acids Res 39:W339–W346

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  100. Blin K, Medema MH, Kazempour D, Fischbach MA, Breitling R, Takano E, Weber T (2013) antiSMASH 2.0--a versatile platform for genome mining of secondary metabolite producers. Nucleic Acids Res 41:W204–W212

    PubMed  PubMed Central  CrossRef  Google Scholar 

  101. Weber T, Blin K, Duddela S, Krug D, Kim HU, Bruccoleri R, Lee SY, Fischbach MA, Muller R, Wohlleben W et al (2015) antiSMASH 3.0-a comprehensive resource for the genome mining of biosynthetic gene clusters. Nucleic Acids Res 43:W237–W243

    PubMed  PubMed Central  CrossRef  Google Scholar 

  102. Yin Y, Mao X, Yang J, Chen X, Mao F, Xu Y (2012) dbCAN: a web resource for automated carbohydrate-active enzyme annotation. Nucleic Acids Res 40:W445–W451

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  103. Desai DK, Nandi S, Srivastava PK, Lynn AM (2011) ModEnzA: accurate identification of metabolic enzymes using function specific profile HMMs with optimised discrimination threshold and modified emission probabilities. Adv Bioinformatics 2011:743782

    PubMed  PubMed Central  CrossRef  CAS  Google Scholar 

  104. Wolf YI, Brenner SE, Bash PA, Koonin EV (1999) Distribution of protein folds in the three superkingdoms of life. Genome Res 9:17–26

    CAS  PubMed  Google Scholar 

  105. Sigrist CJ, Cerutti L, Hulo N, Gattiker A, Falquet L, Pagni M, Bairoch A, Bucher P (2002) PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform 3:265–274

    CAS  PubMed  CrossRef  Google Scholar 

  106. Sigrist CJ, de CE, Cerutti L, Cuche BA, Hulo N, Bridge A, Bougueleret L, Xenarios I (2013) New and continuing developments at PROSITE. Nucleic Acids Res 41:D344–D347

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  107. Puntervoll P, Linding R, Gemund C, Chabanis-Davidson S, Mattingsdal M, Cameron S, Martin DM, Ausiello G, Brannetti B, Costantini A et al (2003) ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acids Res 31:3625–3630

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  108. Berezovsky IN, Grosberg AY, Trifonov EN (2000) Closed loops of nearly standard size: common basic element of protein structure. FEBS Lett 466:283–286

    CAS  PubMed  CrossRef  Google Scholar 

  109. Goncearenco A, Berezovsky IN (2010) Prototypes of elementary functional loops unravel evolutionary connections between protein functions. Bioinformatics 26:i497–i503

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  110. Goncearenco A, Berezovsky IN (2015) Protein function from its emergence to diversity in contemporary proteins. Phys Biol 12:045002

    PubMed  CrossRef  Google Scholar 

  111. Mott R (2000) Accurate formula for P-values of gapped local sequence and profile alignments. J Mol Biol 300:649–659

    CAS  PubMed  CrossRef  Google Scholar 

  112. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410

    CAS  PubMed  CrossRef  Google Scholar 

  113. Dayhoff M (1979) Atlas of protein sequence and structure. National Biomedical Research Foundation, Washington, DC

    Google Scholar 

  114. Altenhoff AM, Schneider A, Gonnet GH, Dessimoz C (2011) OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res 39:D289–D294

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  115. Roth AC, Gonnet GH, Dessimoz C (2008) Algorithm of OMA for large-scale orthology inference. BMC Bioinformatics 9:518

    PubMed  PubMed Central  CrossRef  CAS  Google Scholar 

  116. Biegert A, Soding J (2009) Sequence context-specific profiles for homology searching. Proc Natl Acad Sci U S A 106:3770–3775

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  117. Pearson WR (1998) Empirical statistical estimates for sequence similarity searches. J Mol Biol 276:71–84

    CAS  PubMed  CrossRef  Google Scholar 

  118. Pearson WR (2000) Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol 132:185–219

    CAS  PubMed  Google Scholar 

  119. Sirota FL, Ooi HS, Gattermayer T, Schneider G, Eisenhaber F, Maurer-Stroh S (2010) Parameterization of disorder predictors for large-scale applications requiring high specificity by using an extended benchmark dataset. BMC Genomics 11(Suppl 1):S15

    PubMed  PubMed Central  CrossRef  CAS  Google Scholar 

  120. Enright AJ, Van DS, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30:1575–1584

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  121. van Dongen S (2008) Graph clustering via a discrete uncoupling process. SIAM J Matrix Anal Appl 30:121–141

    CrossRef  Google Scholar 

  122. Li W, Jaroszewski L, Godzik A (2001) Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 17:282–283

    CAS  PubMed  CrossRef  Google Scholar 

  123. Li W, Jaroszewski L, Godzik A (2002) Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics 18:77–82

    CAS  PubMed  CrossRef  Google Scholar 

  124. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659

    CAS  PubMed  CrossRef  Google Scholar 

  125. Notredame C, Higgins DG, Heringa J (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302:205–217

    CAS  PubMed  CrossRef  Google Scholar 

  126. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  127. Edgar RC (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5:113

    PubMed  PubMed Central  CrossRef  CAS  Google Scholar 

  128. Do CB, Mahabhashyam MS, Brudno M, Batzoglou S (2005) ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res 15:330–340

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  129. Katoh K, Misawa K, Kuma K, Miyata T (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30:3059–3066

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  130. Katoh K, Kuma K, Toh H, Miyata T (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33:511–518

    CAS  PubMed  PubMed Central  CrossRef  Google Scholar 

  131. Katoh K, Toh H (2007) PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences. Bioinformatics 23:372–374

    CAS  PubMed  CrossRef  Google Scholar 

  132. Katoh K, Toh H (2008) Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinform 9:286–298

    CAS  PubMed  CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Frank Eisenhaber .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2016 Springer Science+Business Media New York

About this protocol

Cite this protocol

Eisenhaber, B. et al. (2016). The Recipe for Protein Sequence-Based Function Prediction and Its Implementation in the ANNOTATOR Software Environment. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 1415. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3572-7_25

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-3572-7_25

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-3570-3

  • Online ISBN: 978-1-4939-3572-7

  • eBook Packages: Springer Protocols