Abstract
The genomes of an increasing number of species are being investigated through the generation of expressed sequence tags (ESTs). However, ESTs are prone to sequencing errors and typically define incomplete transcripts, making downstream annotation difficult. Annotation would be greatly improved with robust polypeptide translations. Many current solutions for EST translation require a large number of full-length gene sequences for training purposes, a resource that is not available for the majority of EST projects. As part of our ongoing EST programs investigating these “neglected” genomes, we have developed a polypeptide prediction pipeline, prot4EST. It incorporates freely available software to produce final translations that are more accurate than those derived from any single method. We describe how this integrated approach goes a long way to overcoming the deficit in training data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Mulder, N. J., Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Buillard, V., Cerutti, L., Copley, R., Courcelle, E., Das, U., Daugherty, L., Dibley, M., Finn, R., Fleischmann, W., Gough, J., Haft, D., Hulo, N., Hunter, S., Kahn, D., Kanapin, A., Kejariwal, A., Labarga, A., Langendijk-Genevaux, P. S., Lonsdale, D., Lopez, R., Letunic, I., Madera, M., Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell, A., Nikolskaya, A. N., Orchard, S., Orengo, C., Petryszak, R., Selengut, J. D., Sigrist, C. J., Thomas, P. D., Valentin, F., Wilson, D., Wu, C. H., and Yeats, C. (2007) New developments in the InterPro database. Nucleic Acids Res 35, D224–8.
Bateman, A., Coin, L., Durbin, R., Finn, R. D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E. L., Studholme, D. J., Yeats, C., and Eddy, S. R. (2004) The Pfam protein families database. Nucleic Acids Res 32 Database issue, D138–41.
Guo, J. T., Ellrott, K., and Xu, Y. (2007) Preface. Methods Mol Biol 413, 3–42.
Wasmuth, J. D., and Blaxter, M. L. (2004) prot4EST: Translating Expressed Sequence Tags from neglected genomes. BMC Bioinformatics 5, 187.
Altschul, S. F., and Koonin, E. V. (1998) Iterated profile searches with PSI-BLAST–a tool for discovery in protein databases. Trends Biochem Sci 23, 444–7.
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–402.
Pearson, W. P. (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Methods in Enzymology 183, 63–98.
Cuff, J. A., Birney, E., Clamp, M. E., and Barton, G. J. (2000) ProtEST: protein multiple sequence alignments from expressed sequence tags. Bioinformatics 16, 111–6.
Ewing, B., and Green, P. (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8, 186–94.
Ewing, B., Hillier, L., Wendl, M. C., and Green, P. (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 8, 175–85.
Wasmuth, J., Schmid, R., Hedley, A., and Blaxter, M. (2008) On the extent and origins of genic novelty in the Phylum Nematoda. PLoS Negl Trop Dis, 2, e258.
Hatzigeorgiou, A. G., Fiziev, P., and Reczko, M. (2001) DIANA-EST: a statistical analysis. Bioinformatics 17, 913–9.
Iseli, C., Jongeneel, C. V., and Bucher, P. (1999) in “Proc Int Conf Intell Syst Mol Biol”, 138–48.
Fukunishi, Y., and Hayashizaki, Y. (2001) Amino acid translation program for full-length cDNA sequences with frameshift errors. Physiol Genomics 5, 81–7.
Parkinson, J., Anthony, A., Wasmuth, J., Schmid, R., Hedley, A., and Blaxter, M. (2004) PartiGene - constructing partial genomes. Bioinformatics 20, 1398–404.
The UniProt Consortium (2008) The universal protein resource (UniProt). Nucleic Acids Res 36, D190–5.
Papanicolaou, A., Joron, M., McMillan, W. O., Blaxter, M. L., and Jiggins, C. D. (2005) Genomic tools and cDNA derived markers for butterflies. Mol Ecol 14, 2883–97.
Crosby, M. A., Goodman, J. L., Strelets, V. B., Zhang, P., and Gelbart, W. M. (2007) FlyBase: genomes by the dozen. Nucleic Acids Res 35, D486–91.
Wuyts, J., Van de Peer, Y., Winkelmans, T., and De Wachter, R. (2002) The European database on small subunit ribosomal RNA. Nucleic Acids Res 30, 183–5.
Nakamura, Y., Gojobori, T., and Ikemura, T. (2000) Codon usage tabulated from international DNA sequence databases: status for the year 2000. Nucleic Acids Res 28, 292.
Kulikova, T., Akhtar, R., Aldebert, P., Althorpe, N., Andersson, M., Baldwin, A., Bates, K., Bhattacharyya, S., Bower, L., Browne, P., Castro, M., Cochrane, G., Duggan, K., Eberhardt, R., Faruque, N., Hoad, G., Kanz, C., Lee, C., Leinonen, R., Lin, Q., Lombard, V., Lopez, R., Lorenc, D., McWilliam, H., Mukherjee, G., Nardone, F., Pastor, M. P., Plaister, S., Sobhany, S., Stoehr, P., Vaughan, R., Wu, D., Zhu, W., and Apweiler, R. (2007) EMBL Nucleotide Sequence Database in 2006. Nucleic Acids Res 35, D16–20.
Acknowledgments
We would like to thank our colleagues in Edinburgh and Toronto for support and user feedback, and the authors of the Decoder and ESTScan programs for permission to use their code. This work was carried out while JW was a BBSRC-funded PhD student.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Humana Press, a part of Springer Science+Business Media, LLC
About this protocol
Cite this protocol
Wasmuth, J., Blaxter, M. (2009). Obtaining Accurate Translations from Expressed Sequence Tags. In: Parkinson, J. (eds) Expressed Sequence Tags (ESTs). Methods in Molecular Biology, vol 533. Humana Press. https://doi.org/10.1007/978-1-60327-136-3_10
Download citation
DOI: https://doi.org/10.1007/978-1-60327-136-3_10
Published:
Publisher Name: Humana Press
Print ISBN: 978-1-58829-759-4
Online ISBN: 978-1-60327-136-3
eBook Packages: Springer Protocols