Integrated Tools for Biomolecular Sequence-Based Function Prediction as Exemplified by the ANNOTATOR Software Environment

  • Georg Schneider
  • Michael Wildpaner
  • Fernanda L. Sirota
  • Sebastian Maurer-Stroh
  • Birgit Eisenhaber
  • Frank Eisenhaber
Part of the Methods in Molecular Biology book series (MIMB, volume 609)


Given the amount of sequence data available today, in silico function prediction, which often includes detecting distant evolutionary relationships, requires sophisticated bioinformatic workflows. The algorithms behind these workflows exhibit complex data structures; they need the ability to spawn subtasks and tend to demand large amounts of resources. Performing sequence analytic tasks by manually invoking individual function prediction algorithms having to transform between differing input and output formats has become increasingly obsolete. After a period of linking individual predictors using ad hoc scripts, a number of integrated platforms are finally emerging. We present the ANNOTATOR software environment as an advanced example of such a platform.

Key words

sequence analysis function prediction visualization 


  1. 1.
    Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., Wheeler, D. L. (2008) GenBank. Nucleic Acids Res 36, D25–D30, 10.1093/nar/gkm929.CrossRefPubMedGoogle Scholar
  2. 2.
    Blattner, F. R., Plunkett, G., Bloch, C. A., Perna, N. T., Burland, V., Riley, M., Collado-Vides, J., Glasner, J. D., Rode, C. K., Mayhew, G. F., et al. (1997) The Complete genome sequence of Escherichia coli K-12. Science 277, 1453–1462, 10.1126/science. 277.5331.1453.CrossRefPubMedGoogle Scholar
  3. 3.
    Peña-Castillo, L., Hughes, T. R. (2007) Why are there still over 1000 uncharacterized yeast genes? Genetics 176, 7–14, 10.1534/genetics.107.074468.CrossRefPubMedGoogle Scholar
  4. 4.
    Cserzo, M., Eisenhaber, F., Eisenhaber, B., Simon, I. (2004) TM or not TM: transmembrane protein prediction with low false positive rate using DAS-TMfilter. Bioinformatics 20, 136–137.CrossRefPubMedGoogle Scholar
  5. 5.
    Tusnády, G. E., Simon, I. (2001) The HMMTOP transmembrane topology prediction server. Bioinformatics 17, 849–850.CrossRefPubMedGoogle Scholar
  6. 6.
    Krogh, A., Larsson, B., von Heijne, G., Sonnhammer, E. L. (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305, 567–580, 10.1006/jmbi.2000.4315.CrossRefPubMedGoogle Scholar
  7. 7.
    Käll, L., Krogh, A., Sonnhammer, E. L. L. (2004) A combined transmembrane topology and signal peptide prediction method. J Mol Biol 338, 1027–1036, 10.1016/j.jmb.2004.03.016.CrossRefPubMedGoogle Scholar
  8. 8.
    Schneider, G., Neuberger, G., Wildpaner, M., Tian, S., Berezovsky, I., Eisenhaber, F. (2006) Application of a sensitive collection heuristic for very large protein families: evolutionary relationship between adipose triglyceride lipase (ATGL) and classic mammalian lipases. BMC Bioinformatics 7, 164, 10.1186/1471-2105-7-164.CrossRefPubMedGoogle Scholar
  9. 9.
    Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–3402.CrossRefPubMedGoogle Scholar
  10. 10.
    Wootton, J. C. (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 18, 269–285.CrossRefPubMedGoogle Scholar
  11. 11.
    Lupas, A., Van Dyke, M., Stock, J. (1991) Predicting coiled coils from protein sequences. Science 252, 1162–1164, 10.1126/science.252.5009.1162.CrossRefGoogle Scholar
  12. 12.
    Stajich, J. E., Block, D., Boulez, K., Brenner, S. E., Chervitz, S. A., Dagdigian, C., Fuellen, G., Gilbert, J. G. R., Korf, I., Lapp, H., et al. (2002) The Bioperl toolkit: Perl modules for the life sciences. Genome Res 12, 1611–1618, 10.1101/gr.361602.CrossRefPubMedGoogle Scholar
  13. 13.
    Stajich, J. E. (2007) An Introduction to BioPerl. Methods Mol Biol 406, 535–548.CrossRefPubMedGoogle Scholar
  14. 14.
    Mangalam, H. (2002) The Bio* toolkits – a brief overview. Brief Bioinform 3, 296–302.CrossRefPubMedGoogle Scholar
  15. 15.
    Rice, P., Longden, I., Bleasby, A. (2000) EMBOSS: the European molecular biology open software suite. Trends Genet 16, 276–277.CrossRefPubMedGoogle Scholar
  16. 16.
    Misra, S., Crosby, M. A., Mungall, C. J., Matthews, B. B., Campbell, K. S., Hradecky, P., Huang, Y., Kaminker, J. S., Millburn, G. H., Prochnik, S. E., et al. (2002) Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome Biol 3, RESEARCH0083.Google Scholar
  17. 17.
    Mungall, C. J., Misra, S., Berman, B. P., Carlson, J., Frise, E., Harris, N., Marshall, B., Shu, S., Kaminker, J. S., Prochnik, S. E., et al. (2002) An integrated computational pipeline and database to support whole-genome sequence annotation. Genome Biol 3, RESEARCH0081.Google Scholar
  18. 18.
    Meyer, F., Goesmann, A., McHardy, A. C., Bartels, D., Bekel, T., Clausen, J., Kalinowski, J., Linke, B., Rupp, O., Giegerich, R., et al. (2003) GenDB – an open source genome annotation system for prokaryote genomes. Nucleic Acids Res 31, 2187–2195.CrossRefPubMedGoogle Scholar
  19. 19.
    Letondal, C. (2001) A Web interface generator for molecular biology programs in Unix. Bioinformatics 17, 73–82.CrossRefPubMedGoogle Scholar
  20. 20.
    Senger, M., Rice, P., Oinn, T. (2003) Soaplab – a unified Sesame door to analysis tools. In Proceedings of the UK e-Science, All Hands Meeting. Simon J Cox, pp. 509–513.Google Scholar
  21. 21.
    Gudgin, M., Hadley, M., Mendelsohn, N., Jean-Jaques, M., Nielsen, H. (2003) SOAP Version 1.2 Part 1: Messaging Framework. W3C Recommendation. Available at:
  22. 22.
    Oinn, T., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, M., Carver, T., Glover, K., Pocock, M. R., Wipat, A., et al. (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20, 3045–3054, 10.1093/bioinformatics/bth361.CrossRefPubMedGoogle Scholar
  23. 23.
    Wilkinson, M. D., Senger, M., Kawas, E., Bruskiewich, R., Gouzy, J., Noirot, C. (2008) Interoperability with Moby 1.0–It’s better than sharing your toothbrush! Brief Bioinformatics, 10.1093/bib/bbn003, 10.1093/bib/bbn003.Google Scholar
  24. 24.
    Kawas, E., Senger, M., Wilkinson, M. D. (2006) BioMoby extensions to the Taverna workflow management and enactment software. BMC Bioinformatics 7, 523.CrossRefPubMedGoogle Scholar
  25. 25.
    Shah, S. P., He, D. Y. M., Sawkins, J. N., Druce, J. C., Quon, G., Lett, D., Zheng, G. X. Y., Xu, T., Ouellette, B. F. F. (2004) Pegasys: software for executing and integrating analyses of biological sequences. BMC Bioinformatics 5, 40.CrossRefPubMedGoogle Scholar
  26. 26.
    Tang, F., Chua, C. L., Ho, L., Lim, Y. P., Issac, P., Krishnan, A. (2005) Wildfire: distributed, Grid-enabled workflow construction and execution. BMC Bioinformatics 6, 69.CrossRefPubMedGoogle Scholar
  27. 27.
    Lian, C. C., Tang, F., Issac, P., Krishnan, A. (2005) GEL: grid execution language. J Parallel Distr Com 65, 857–869.CrossRefGoogle Scholar
  28. 28.
    Eisenhaber, F. (2006) Prediction of protein function. In Discovering Biomolecular Mechanisms with Computational Biology. Springer, US, pp. 39–54.CrossRefGoogle Scholar
  29. 29.
    Promponas, V. J., Enright, A. J., Tsoka, S., Kreil, D. P., Leroy, C., Hamodrakas, S., Sander, C., Ouzounis, C. A. (2000) CAST: an iterative algorithm for the complexity analysis of sequence tracts. Complexity analysis of sequence tracts. Bioinformatics 16, 915–922.CrossRefPubMedGoogle Scholar
  30. 30.
    Wootton, J. C. (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 18, 269–285.CrossRefPubMedGoogle Scholar
  31. 31.
    Dosztányi, Z., Csizmók, V., Tompa, P., Simon, I. (2005) The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J Mol Biol 347, 827–839, 10.1016/j.jmb.2005.01.071.CrossRefPubMedGoogle Scholar
  32. 32.
    Eisenhaber, B., Bork, P., Eisenhaber, F. (1999) Prediction of potential GPI-modification sites in proprotein sequences. J Mol Biol 292, 741–758, 10.1006/jmbi.1999.3069.CrossRefPubMedGoogle Scholar
  33. 33.
    Eisenhaber, B., Wildpaner, M., Schultz, C. J., Borner, G. H. H., Dupree, P., Eisenhaber, F. (2003) Glycosylphosphatidylinositol lipid anchoring of plant proteins. Sensitive prediction from sequence- and genome-wide studies for Arabidopsis and rice. Plant Physiol 133, 1691–1701, 10.1104/pp.103.023580.CrossRefPubMedGoogle Scholar
  34. 34.
    Eisenhaber, B., Schneider, G., Wildpaner, M., Eisenhaber, F. (2004) A sensitive predictor for potential GPI lipid modification sites in fungal protein sequences and its application to genome-wide studies for Aspergillus nidulans, Candida albicans, Neurospora crassa, Saccharomyces cerevisiae and Schizosaccharomyces pombe. J Mol Biol 337, 243–253, 10.1016/j.jmb.2004.01.025.CrossRefPubMedGoogle Scholar
  35. 35.
    Maurer-Stroh, S., Eisenhaber, B., Eisenhaber, F. (2002) N-terminal N-myristoylation of proteins: prediction of substrate proteins from amino acid sequence. J Mol Biol 317, 541–557, 10.1006/jmbi.2002.5426.CrossRefPubMedGoogle Scholar
  36. 36.
    Maurer-Stroh, S., Eisenhaber, B., Eisenhaber, F. (2002) N-terminal N-myristoylation of proteins: refinement of the sequence motif and its taxon-specific differences. J Mol Biol 317, 523–540, 10.1006/jmbi.2002.5425.CrossRefPubMedGoogle Scholar
  37. 37.
    Maurer-Stroh, S., Eisenhaber, F. (2005) Refinement and prediction of protein prenylation motifs. Genome Biol 6, R55, 10.1186/gb-2005-6-6-r55.Google Scholar
  38. 38.
    Neuberger, G., Maurer-Stroh, S., Eisenhaber, B., Hartig, A., Eisenhaber, F. (2003) Prediction of peroxisomal targeting signal 1 containing proteins from amino acid sequence. J Mol Biol 328, 581–592.CrossRefPubMedGoogle Scholar
  39. 39.
    Eddy, S. R. (1998) Profile hidden Markov models. Bioinformatics 14, 755–763.CrossRefPubMedGoogle Scholar
  40. 40.
    Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., Cuche, B. A., de Castro, E., Lachaize, C., Langendijk-Genevaux, P. S., Sigrist, C. J. A. (2008) The 20 years of PROSITE. Nucleic Acids Res 36, D245–D249, 10.1093/nar/gkm977.CrossRefPubMedGoogle Scholar
  41. 41.
    Schäffer, A. A., Wolf, Y. I., Ponting, C. P., Koonin, E. V., Aravind, L., Altschul, S. F. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15, 1000–1011.CrossRefPubMedGoogle Scholar
  42. 42.
    Marchler-Bauer, A., Panchenko, A. R., Shoemaker, B. A., Thiessen, P. A., Geer, L. Y., Bryant, S. H. (2002) CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res 30, 281–283.CrossRefPubMedGoogle Scholar
  43. 43.
    Letunic, I., Doerks, T., Bork, P. (2009) SMART 6: recent updates and new developments. Nucleic Acids Res 37, D229–D232, 10.1093/nar/gkn808.CrossRefPubMedGoogle Scholar
  44. 44.
    Finn, R. D., Tate, J., Mistry, J., Coggill, P. C., Sammut, S. J., Hotz, H., Ceric, G., Forslund, K., Eddy, S. R., Sonnhammer, E. L. L., et al. (2008) The Pfam protein families database. Nucleic Acids Res 36, D281–D288, 10.1093/nar/gkm960.CrossRefPubMedGoogle Scholar
  45. 45.
    Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman, D. J. (1990) Basic local alignment search tool. J Mol Biol 215, 403–410, 10.1006/jmbi.1990.9999.PubMedGoogle Scholar

Copyright information

© Humana Press, a part of Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Georg Schneider
    • 1
  • Michael Wildpaner
    • 2
  • Fernanda L. Sirota
    • 1
  • Sebastian Maurer-Stroh
    • 1
  • Birgit Eisenhaber
    • 3
  • Frank Eisenhaber
    • 1
  1. 1.Bioinformatics Institute (BII), Agency for Science, Technology, and Research(A*STAR)SingaporeSingapore
  2. 2.Google Switzerland GmbHZürichSwitzerland
  3. 3.Experimental Therapeutic Centre (ETC), Bioinformatics Institute (BII),Agency for Science, Technology, and Research (A*STAR)SingaporeSingapore

Personalised recommendations