Integrated Tools for Biomolecular Sequence-Based Function Prediction as Exemplified by the ANNOTATOR Software Environment

Schneider, Georg; Wildpaner, Michael; Sirota, Fernanda L.; Maurer-Stroh, Sebastian; Eisenhaber, Birgit; Eisenhaber, Frank

doi:10.1007/978-1-60327-241-4_15

Integrated Tools for Biomolecular Sequence-Based Function Prediction as Exemplified by the ANNOTATOR Software Environment

Georg Schneider³,
Michael Wildpaner⁴,
Fernanda L. Sirota³,
Sebastian Maurer-Stroh³,
Birgit Eisenhaber⁵ &
…
Frank Eisenhaber³

Protocol
First Online: 30 October 2009

3327 Accesses
12 Citations

Part of the book series: Methods in Molecular Biology ((MIMB,volume 609))

Abstract

Given the amount of sequence data available today, in silico function prediction, which often includes detecting distant evolutionary relationships, requires sophisticated bioinformatic workflows. The algorithms behind these workflows exhibit complex data structures; they need the ability to spawn subtasks and tend to demand large amounts of resources. Performing sequence analytic tasks by manually invoking individual function prediction algorithms having to transform between differing input and output formats has become increasingly obsolete. After a period of linking individual predictors using ad hoc scripts, a number of integrated platforms are finally emerging. We present the ANNOTATOR software environment as an advanced example of such a platform.

This is a preview of subscription content, log in via an institution.

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 159.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

References

Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., Wheeler, D. L. (2008) GenBank. Nucleic Acids Res 36, D25–D30, 10.1093/nar/gkm929.
Article CAS PubMed Google Scholar
Blattner, F. R., Plunkett, G., Bloch, C. A., Perna, N. T., Burland, V., Riley, M., Collado-Vides, J., Glasner, J. D., Rode, C. K., Mayhew, G. F., et al. (1997) The Complete genome sequence of Escherichia coli K-12. Science 277, 1453–1462, 10.1126/science. 277.5331.1453.
Article CAS PubMed Google Scholar
Peña-Castillo, L., Hughes, T. R. (2007) Why are there still over 1000 uncharacterized yeast genes? Genetics 176, 7–14, 10.1534/genetics.107.074468.
Article PubMed Google Scholar
Cserzo, M., Eisenhaber, F., Eisenhaber, B., Simon, I. (2004) TM or not TM: transmembrane protein prediction with low false positive rate using DAS-TMfilter. Bioinformatics 20, 136–137.
Article CAS PubMed Google Scholar
Tusnády, G. E., Simon, I. (2001) The HMMTOP transmembrane topology prediction server. Bioinformatics 17, 849–850.
Article PubMed Google Scholar
Krogh, A., Larsson, B., von Heijne, G., Sonnhammer, E. L. (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305, 567–580, 10.1006/jmbi.2000.4315.
Article CAS PubMed Google Scholar
Käll, L., Krogh, A., Sonnhammer, E. L. L. (2004) A combined transmembrane topology and signal peptide prediction method. J Mol Biol 338, 1027–1036, 10.1016/j.jmb.2004.03.016.
Article PubMed Google Scholar
Schneider, G., Neuberger, G., Wildpaner, M., Tian, S., Berezovsky, I., Eisenhaber, F. (2006) Application of a sensitive collection heuristic for very large protein families: evolutionary relationship between adipose triglyceride lipase (ATGL) and classic mammalian lipases. BMC Bioinformatics 7, 164, 10.1186/1471-2105-7-164.
Article PubMed Google Scholar
Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–3402.
Article CAS PubMed Google Scholar
Wootton, J. C. (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 18, 269–285.
Article CAS PubMed Google Scholar
Lupas, A., Van Dyke, M., Stock, J. (1991) Predicting coiled coils from protein sequences. Science 252, 1162–1164, 10.1126/science.252.5009.1162.
Article CAS Google Scholar
Stajich, J. E., Block, D., Boulez, K., Brenner, S. E., Chervitz, S. A., Dagdigian, C., Fuellen, G., Gilbert, J. G. R., Korf, I., Lapp, H., et al. (2002) The Bioperl toolkit: Perl modules for the life sciences. Genome Res 12, 1611–1618, 10.1101/gr.361602.
Article CAS PubMed Google Scholar
Stajich, J. E. (2007) An Introduction to BioPerl. Methods Mol Biol 406, 535–548.
Article CAS PubMed Google Scholar
Mangalam, H. (2002) The Bio* toolkits – a brief overview. Brief Bioinform 3, 296–302.
Article PubMed Google Scholar
Rice, P., Longden, I., Bleasby, A. (2000) EMBOSS: the European molecular biology open software suite. Trends Genet 16, 276–277.
Article CAS PubMed Google Scholar
Misra, S., Crosby, M. A., Mungall, C. J., Matthews, B. B., Campbell, K. S., Hradecky, P., Huang, Y., Kaminker, J. S., Millburn, G. H., Prochnik, S. E., et al. (2002) Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome Biol 3, RESEARCH0083.
Google Scholar
Mungall, C. J., Misra, S., Berman, B. P., Carlson, J., Frise, E., Harris, N., Marshall, B., Shu, S., Kaminker, J. S., Prochnik, S. E., et al. (2002) An integrated computational pipeline and database to support whole-genome sequence annotation. Genome Biol 3, RESEARCH0081.
Google Scholar
Meyer, F., Goesmann, A., McHardy, A. C., Bartels, D., Bekel, T., Clausen, J., Kalinowski, J., Linke, B., Rupp, O., Giegerich, R., et al. (2003) GenDB – an open source genome annotation system for prokaryote genomes. Nucleic Acids Res 31, 2187–2195.
Article CAS PubMed Google Scholar
Letondal, C. (2001) A Web interface generator for molecular biology programs in Unix. Bioinformatics 17, 73–82.
Article CAS PubMed Google Scholar
Senger, M., Rice, P., Oinn, T. (2003) Soaplab – a unified Sesame door to analysis tools. In Proceedings of the UK e-Science, All Hands Meeting. Simon J Cox, pp. 509–513.
Google Scholar
Gudgin, M., Hadley, M., Mendelsohn, N., Jean-Jaques, M., Nielsen, H. (2003) SOAP Version 1.2 Part 1: Messaging Framework. W3C Recommendation. Available at: http://www.w3.org/TR/soap12-part1.
Oinn, T., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, M., Carver, T., Glover, K., Pocock, M. R., Wipat, A., et al. (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20, 3045–3054, 10.1093/bioinformatics/bth361.
Article CAS PubMed Google Scholar
Wilkinson, M. D., Senger, M., Kawas, E., Bruskiewich, R., Gouzy, J., Noirot, C. (2008) Interoperability with Moby 1.0–It’s better than sharing your toothbrush! Brief Bioinformatics, 10.1093/bib/bbn003, 10.1093/bib/bbn003.
Google Scholar
Kawas, E., Senger, M., Wilkinson, M. D. (2006) BioMoby extensions to the Taverna workflow management and enactment software. BMC Bioinformatics 7, 523.
Article PubMed Google Scholar
Shah, S. P., He, D. Y. M., Sawkins, J. N., Druce, J. C., Quon, G., Lett, D., Zheng, G. X. Y., Xu, T., Ouellette, B. F. F. (2004) Pegasys: software for executing and integrating analyses of biological sequences. BMC Bioinformatics 5, 40.
Article PubMed Google Scholar
Tang, F., Chua, C. L., Ho, L., Lim, Y. P., Issac, P., Krishnan, A. (2005) Wildfire: distributed, Grid-enabled workflow construction and execution. BMC Bioinformatics 6, 69.
Article PubMed Google Scholar
Lian, C. C., Tang, F., Issac, P., Krishnan, A. (2005) GEL: grid execution language. J Parallel Distr Com 65, 857–869.
Article Google Scholar
Eisenhaber, F. (2006) Prediction of protein function. In Discovering Biomolecular Mechanisms with Computational Biology. Springer, US, pp. 39–54.
Chapter Google Scholar
Promponas, V. J., Enright, A. J., Tsoka, S., Kreil, D. P., Leroy, C., Hamodrakas, S., Sander, C., Ouzounis, C. A. (2000) CAST: an iterative algorithm for the complexity analysis of sequence tracts. Complexity analysis of sequence tracts. Bioinformatics 16, 915–922.
Article CAS PubMed Google Scholar
Wootton, J. C. (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 18, 269–285.
Article CAS PubMed Google Scholar
Dosztányi, Z., Csizmók, V., Tompa, P., Simon, I. (2005) The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J Mol Biol 347, 827–839, 10.1016/j.jmb.2005.01.071.
Article PubMed Google Scholar
Eisenhaber, B., Bork, P., Eisenhaber, F. (1999) Prediction of potential GPI-modification sites in proprotein sequences. J Mol Biol 292, 741–758, 10.1006/jmbi.1999.3069.
Article CAS PubMed Google Scholar
Eisenhaber, B., Wildpaner, M., Schultz, C. J., Borner, G. H. H., Dupree, P., Eisenhaber, F. (2003) Glycosylphosphatidylinositol lipid anchoring of plant proteins. Sensitive prediction from sequence- and genome-wide studies for Arabidopsis and rice. Plant Physiol 133, 1691–1701, 10.1104/pp.103.023580.
Article CAS PubMed Google Scholar
Eisenhaber, B., Schneider, G., Wildpaner, M., Eisenhaber, F. (2004) A sensitive predictor for potential GPI lipid modification sites in fungal protein sequences and its application to genome-wide studies for Aspergillus nidulans, Candida albicans, Neurospora crassa, Saccharomyces cerevisiae and Schizosaccharomyces pombe. J Mol Biol 337, 243–253, 10.1016/j.jmb.2004.01.025.
Article CAS PubMed Google Scholar
Maurer-Stroh, S., Eisenhaber, B., Eisenhaber, F. (2002) N-terminal N-myristoylation of proteins: prediction of substrate proteins from amino acid sequence. J Mol Biol 317, 541–557, 10.1006/jmbi.2002.5426.
Article CAS PubMed Google Scholar
Maurer-Stroh, S., Eisenhaber, B., Eisenhaber, F. (2002) N-terminal N-myristoylation of proteins: refinement of the sequence motif and its taxon-specific differences. J Mol Biol 317, 523–540, 10.1006/jmbi.2002.5425.
Article CAS PubMed Google Scholar
Maurer-Stroh, S., Eisenhaber, F. (2005) Refinement and prediction of protein prenylation motifs. Genome Biol 6, R55, 10.1186/gb-2005-6-6-r55.
Google Scholar
Neuberger, G., Maurer-Stroh, S., Eisenhaber, B., Hartig, A., Eisenhaber, F. (2003) Prediction of peroxisomal targeting signal 1 containing proteins from amino acid sequence. J Mol Biol 328, 581–592.
Article CAS PubMed Google Scholar
Eddy, S. R. (1998) Profile hidden Markov models. Bioinformatics 14, 755–763.
Article CAS PubMed Google Scholar
Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., Cuche, B. A., de Castro, E., Lachaize, C., Langendijk-Genevaux, P. S., Sigrist, C. J. A. (2008) The 20 years of PROSITE. Nucleic Acids Res 36, D245–D249, 10.1093/nar/gkm977.
Article CAS PubMed Google Scholar
Schäffer, A. A., Wolf, Y. I., Ponting, C. P., Koonin, E. V., Aravind, L., Altschul, S. F. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15, 1000–1011.
Article PubMed Google Scholar
Marchler-Bauer, A., Panchenko, A. R., Shoemaker, B. A., Thiessen, P. A., Geer, L. Y., Bryant, S. H. (2002) CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res 30, 281–283.
Article CAS PubMed Google Scholar
Letunic, I., Doerks, T., Bork, P. (2009) SMART 6: recent updates and new developments. Nucleic Acids Res 37, D229–D232, 10.1093/nar/gkn808.
Article CAS PubMed Google Scholar
Finn, R. D., Tate, J., Mistry, J., Coggill, P. C., Sammut, S. J., Hotz, H., Ceric, G., Forslund, K., Eddy, S. R., Sonnhammer, E. L. L., et al. (2008) The Pfam protein families database. Nucleic Acids Res 36, D281–D288, 10.1093/nar/gkm960.
Article CAS PubMed Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman, D. J. (1990) Basic local alignment search tool. J Mol Biol 215, 403–410, 10.1006/jmbi.1990.9999.
CAS PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Bioinformatics Institute (BII), Agency for Science, Technology, and Research(A*STAR), Singapore, Singapore
Georg Schneider, Fernanda L. Sirota, Sebastian Maurer-Stroh & Frank Eisenhaber
Google Switzerland GmbH, Zürich, Switzerland
Michael Wildpaner
Experimental Therapeutic Centre (ETC), Bioinformatics Institute (BII),Agency for Science, Technology, and Research (A*STAR), Singapore, Singapore
Birgit Eisenhaber

Authors

Georg Schneider
View author publications
You can also search for this author in PubMed Google Scholar
Michael Wildpaner
View author publications
You can also search for this author in PubMed Google Scholar
Fernanda L. Sirota
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Maurer-Stroh
View author publications
You can also search for this author in PubMed Google Scholar
Birgit Eisenhaber
View author publications
You can also search for this author in PubMed Google Scholar
Frank Eisenhaber
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Max F. Perutz Laboratories GmbH, Universität Wien, Dr. Bohr-Gasse 9, Wien, 1030, Austria
Oliviero Carugo
Research (A*STAR), Agency for Science & Technology, Biopolis Street 30, Singapore, 138671, Singapore
Frank Eisenhaber

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Schneider, G., Wildpaner, M., Sirota, F.L., Maurer-Stroh, S., Eisenhaber, B., Eisenhaber, F. (2010). Integrated Tools for Biomolecular Sequence-Based Function Prediction as Exemplified by the ANNOTATOR Software Environment. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 609. Humana Press. https://doi.org/10.1007/978-1-60327-241-4_15

Download citation

DOI: https://doi.org/10.1007/978-1-60327-241-4_15
Published: 30 October 2009
Publisher Name: Humana Press
Print ISBN: 978-1-60327-240-7
Online ISBN: 978-1-60327-241-4
eBook Packages: Springer Protocols

Publish with us

Policies and ethics