Obtaining Accurate Translations from Expressed Sequence Tags

Wasmuth, James; Blaxter, Mark

doi:10.1007/978-1-60327-136-3_10

James Wasmuth^2,3 &
Mark Blaxter²

Part of the book series: Methods in Molecular Biology ((MIMB,volume 533))

2347 Accesses
4 Citations

Abstract

The genomes of an increasing number of species are being investigated through the generation of expressed sequence tags (ESTs). However, ESTs are prone to sequencing errors and typically define incomplete transcripts, making downstream annotation difficult. Annotation would be greatly improved with robust polypeptide translations. Many current solutions for EST translation require a large number of full-length gene sequences for training purposes, a resource that is not available for the majority of EST projects. As part of our ongoing EST programs investigating these “neglected” genomes, we have developed a polypeptide prediction pipeline, prot4EST. It incorporates freely available software to produce final translations that are more accurate than those derived from any single method. We describe how this integrated approach goes a long way to overcoming the deficit in training data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Mulder, N. J., Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Buillard, V., Cerutti, L., Copley, R., Courcelle, E., Das, U., Daugherty, L., Dibley, M., Finn, R., Fleischmann, W., Gough, J., Haft, D., Hulo, N., Hunter, S., Kahn, D., Kanapin, A., Kejariwal, A., Labarga, A., Langendijk-Genevaux, P. S., Lonsdale, D., Lopez, R., Letunic, I., Madera, M., Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell, A., Nikolskaya, A. N., Orchard, S., Orengo, C., Petryszak, R., Selengut, J. D., Sigrist, C. J., Thomas, P. D., Valentin, F., Wilson, D., Wu, C. H., and Yeats, C. (2007) New developments in the InterPro database. Nucleic Acids Res 35, D224–8.
Article PubMed CAS Google Scholar
Bateman, A., Coin, L., Durbin, R., Finn, R. D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E. L., Studholme, D. J., Yeats, C., and Eddy, S. R. (2004) The Pfam protein families database. Nucleic Acids Res 32 Database issue, D138–41.
Article PubMed CAS Google Scholar
Guo, J. T., Ellrott, K., and Xu, Y. (2007) Preface. Methods Mol Biol 413, 3–42.
Article Google Scholar
Wasmuth, J. D., and Blaxter, M. L. (2004) prot4EST: Translating Expressed Sequence Tags from neglected genomes. BMC Bioinformatics 5, 187.
Article PubMed Google Scholar
Altschul, S. F., and Koonin, E. V. (1998) Iterated profile searches with PSI-BLAST–a tool for discovery in protein databases. Trends Biochem Sci 23, 444–7.
Article PubMed CAS Google Scholar
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–402.
Article PubMed CAS Google Scholar
Pearson, W. P. (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Methods in Enzymology 183, 63–98.
Article PubMed CAS Google Scholar
Cuff, J. A., Birney, E., Clamp, M. E., and Barton, G. J. (2000) ProtEST: protein multiple sequence alignments from expressed sequence tags. Bioinformatics 16, 111–6.
Article PubMed CAS Google Scholar
Ewing, B., and Green, P. (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8, 186–94.
PubMed CAS Google Scholar
Ewing, B., Hillier, L., Wendl, M. C., and Green, P. (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 8, 175–85.
PubMed CAS Google Scholar
Wasmuth, J., Schmid, R., Hedley, A., and Blaxter, M. (2008) On the extent and origins of genic novelty in the Phylum Nematoda. PLoS Negl Trop Dis, 2, e258.
Google Scholar
Hatzigeorgiou, A. G., Fiziev, P., and Reczko, M. (2001) DIANA-EST: a statistical analysis. Bioinformatics 17, 913–9.
Article PubMed CAS Google Scholar
Iseli, C., Jongeneel, C. V., and Bucher, P. (1999) in “Proc Int Conf Intell Syst Mol Biol”, 138–48.
Google Scholar
Fukunishi, Y., and Hayashizaki, Y. (2001) Amino acid translation program for full-length cDNA sequences with frameshift errors. Physiol Genomics 5, 81–7.
PubMed CAS Google Scholar
Parkinson, J., Anthony, A., Wasmuth, J., Schmid, R., Hedley, A., and Blaxter, M. (2004) PartiGene - constructing partial genomes. Bioinformatics 20, 1398–404.
Article PubMed CAS Google Scholar
The UniProt Consortium (2008) The universal protein resource (UniProt). Nucleic Acids Res 36, D190–5.
Article Google Scholar
Papanicolaou, A., Joron, M., McMillan, W. O., Blaxter, M. L., and Jiggins, C. D. (2005) Genomic tools and cDNA derived markers for butterflies. Mol Ecol 14, 2883–97.
Article PubMed CAS Google Scholar
Crosby, M. A., Goodman, J. L., Strelets, V. B., Zhang, P., and Gelbart, W. M. (2007) FlyBase: genomes by the dozen. Nucleic Acids Res 35, D486–91.
Article PubMed CAS Google Scholar
Wuyts, J., Van de Peer, Y., Winkelmans, T., and De Wachter, R. (2002) The European database on small subunit ribosomal RNA. Nucleic Acids Res 30, 183–5.
Article PubMed CAS Google Scholar
Nakamura, Y., Gojobori, T., and Ikemura, T. (2000) Codon usage tabulated from international DNA sequence databases: status for the year 2000. Nucleic Acids Res 28, 292.
Article PubMed CAS Google Scholar
Kulikova, T., Akhtar, R., Aldebert, P., Althorpe, N., Andersson, M., Baldwin, A., Bates, K., Bhattacharyya, S., Bower, L., Browne, P., Castro, M., Cochrane, G., Duggan, K., Eberhardt, R., Faruque, N., Hoad, G., Kanz, C., Lee, C., Leinonen, R., Lin, Q., Lombard, V., Lopez, R., Lorenc, D., McWilliam, H., Mukherjee, G., Nardone, F., Pastor, M. P., Plaister, S., Sobhany, S., Stoehr, P., Vaughan, R., Wu, D., Zhu, W., and Apweiler, R. (2007) EMBL Nucleotide Sequence Database in 2006. Nucleic Acids Res 35, D16–20.
Article PubMed CAS Google Scholar

Download references

Acknowledgments

We would like to thank our colleagues in Edinburgh and Toronto for support and user feedback, and the authors of the Decoder and ESTScan programs for permission to use their code. This work was carried out while JW was a BBSRC-funded PhD student.

Author information

Authors and Affiliations

Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, UK
James Wasmuth & Mark Blaxter
Program for Molecular Structure and Function, Hospital for Sick Children, Ontario, Toronto, Canada
James Wasmuth

Authors

James Wasmuth
View author publications
You can also search for this author in PubMed Google Scholar
Mark Blaxter
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Molecular Structure and Function Hospital for Sick Children, Departments of Biochemistry & Molecular Genetics, University of Toronto, Toronto, ON, Canada
John Parkinson

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Wasmuth, J., Blaxter, M. (2009). Obtaining Accurate Translations from Expressed Sequence Tags. In: Parkinson, J. (eds) Expressed Sequence Tags (ESTs). Methods in Molecular Biology, vol 533. Humana Press. https://doi.org/10.1007/978-1-60327-136-3_10

Download citation

DOI: https://doi.org/10.1007/978-1-60327-136-3_10
Published: 10 March 2009
Publisher Name: Humana Press
Print ISBN: 978-1-58829-759-4
Online ISBN: 978-1-60327-136-3
eBook Packages: Springer Protocols

Publish with us

Policies and ethics