Obtaining Accurate Translations from Expressed Sequence Tags

  • James Wasmuth
  • Mark Blaxter
Part of the Methods in Molecular Biology book series (MIMB, volume 533)


The genomes of an increasing number of species are being investigated through the generation of expressed sequence tags (ESTs). However, ESTs are prone to sequencing errors and typically define incomplete transcripts, making downstream annotation difficult. Annotation would be greatly improved with robust polypeptide translations. Many current solutions for EST translation require a large number of full-length gene sequences for training purposes, a resource that is not available for the majority of EST projects. As part of our ongoing EST programs investigating these “neglected” genomes, we have developed a polypeptide prediction pipeline, prot4EST. It incorporates freely available software to produce final translations that are more accurate than those derived from any single method. We describe how this integrated approach goes a long way to overcoming the deficit in training data.

Key words

Expressed sequence tags ESTs protein translations, simulated transcriptomes 



We would like to thank our colleagues in Edinburgh and Toronto for support and user feedback, and the authors of the Decoder and ESTScan programs for permission to use their code. This work was carried out while JW was a BBSRC-funded PhD student.


  1. 1.
    Mulder, N. J., Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Buillard, V., Cerutti, L., Copley, R., Courcelle, E., Das, U., Daugherty, L., Dibley, M., Finn, R., Fleischmann, W., Gough, J., Haft, D., Hulo, N., Hunter, S., Kahn, D., Kanapin, A., Kejariwal, A., Labarga, A., Langendijk-Genevaux, P. S., Lonsdale, D., Lopez, R., Letunic, I., Madera, M., Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell, A., Nikolskaya, A. N., Orchard, S., Orengo, C., Petryszak, R., Selengut, J. D., Sigrist, C. J., Thomas, P. D., Valentin, F., Wilson, D., Wu, C. H., and Yeats, C. (2007) New developments in the InterPro database. Nucleic Acids Res 35, D224–8.PubMedCrossRefGoogle Scholar
  2. 2.
    Bateman, A., Coin, L., Durbin, R., Finn, R. D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E. L., Studholme, D. J., Yeats, C., and Eddy, S. R. (2004) The Pfam protein families database. Nucleic Acids Res 32 Database issue, D138–41.PubMedCrossRefGoogle Scholar
  3. 3.
    Guo, J. T., Ellrott, K., and Xu, Y. (2007) Preface. Methods Mol Biol 413, 3–42.CrossRefGoogle Scholar
  4. 4.
    Wasmuth, J. D., and Blaxter, M. L. (2004) prot4EST: Translating Expressed Sequence Tags from neglected genomes. BMC Bioinformatics 5, 187.PubMedCrossRefGoogle Scholar
  5. 5.
    Altschul, S. F., and Koonin, E. V. (1998) Iterated profile searches with PSI-BLAST–a tool for discovery in protein databases. Trends Biochem Sci 23, 444–7.PubMedCrossRefGoogle Scholar
  6. 6.
    Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–402.PubMedCrossRefGoogle Scholar
  7. 7.
    Pearson, W. P. (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Methods in Enzymology 183, 63–98.PubMedCrossRefGoogle Scholar
  8. 8.
    Cuff, J. A., Birney, E., Clamp, M. E., and Barton, G. J. (2000) ProtEST: protein multiple sequence alignments from expressed sequence tags. Bioinformatics 16, 111–6.PubMedCrossRefGoogle Scholar
  9. 9.
    Ewing, B., and Green, P. (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8, 186–94.PubMedGoogle Scholar
  10. 10.
    Ewing, B., Hillier, L., Wendl, M. C., and Green, P. (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 8, 175–85.PubMedGoogle Scholar
  11. 11.
    Wasmuth, J., Schmid, R., Hedley, A., and Blaxter, M. (2008) On the extent and origins of genic novelty in the Phylum Nematoda. PLoS Negl Trop Dis, 2, e258.Google Scholar
  12. 12.
    Hatzigeorgiou, A. G., Fiziev, P., and Reczko, M. (2001) DIANA-EST: a statistical analysis. Bioinformatics 17, 913–9.PubMedCrossRefGoogle Scholar
  13. 13.
    Iseli, C., Jongeneel, C. V., and Bucher, P. (1999) in “Proc Int Conf Intell Syst Mol Biol”, 138–48.Google Scholar
  14. 14.
    Fukunishi, Y., and Hayashizaki, Y. (2001) Amino acid translation program for full-length cDNA sequences with frameshift errors. Physiol Genomics 5, 81–7.PubMedGoogle Scholar
  15. 15.
    Parkinson, J., Anthony, A., Wasmuth, J., Schmid, R., Hedley, A., and Blaxter, M. (2004) PartiGene - constructing partial genomes. Bioinformatics 20, 1398–404.PubMedCrossRefGoogle Scholar
  16. 16.
    The UniProt Consortium (2008) The universal protein resource (UniProt). Nucleic Acids Res 36, D190–5.CrossRefGoogle Scholar
  17. 17.
    Papanicolaou, A., Joron, M., McMillan, W. O., Blaxter, M. L., and Jiggins, C. D. (2005) Genomic tools and cDNA derived markers for butterflies. Mol Ecol 14, 2883–97.PubMedCrossRefGoogle Scholar
  18. 18.
    Crosby, M. A., Goodman, J. L., Strelets, V. B., Zhang, P., and Gelbart, W. M. (2007) FlyBase: genomes by the dozen. Nucleic Acids Res 35, D486–91.PubMedCrossRefGoogle Scholar
  19. 19.
    Wuyts, J., Van de Peer, Y., Winkelmans, T., and De Wachter, R. (2002) The European database on small subunit ribosomal RNA. Nucleic Acids Res 30, 183–5.PubMedCrossRefGoogle Scholar
  20. 20.
    Nakamura, Y., Gojobori, T., and Ikemura, T. (2000) Codon usage tabulated from international DNA sequence databases: status for the year 2000. Nucleic Acids Res 28, 292.PubMedCrossRefGoogle Scholar
  21. 21.
    Kulikova, T., Akhtar, R., Aldebert, P., Althorpe, N., Andersson, M., Baldwin, A., Bates, K., Bhattacharyya, S., Bower, L., Browne, P., Castro, M., Cochrane, G., Duggan, K., Eberhardt, R., Faruque, N., Hoad, G., Kanz, C., Lee, C., Leinonen, R., Lin, Q., Lombard, V., Lopez, R., Lorenc, D., McWilliam, H., Mukherjee, G., Nardone, F., Pastor, M. P., Plaister, S., Sobhany, S., Stoehr, P., Vaughan, R., Wu, D., Zhu, W., and Apweiler, R. (2007) EMBL Nucleotide Sequence Database in 2006. Nucleic Acids Res 35, D16–20.PubMedCrossRefGoogle Scholar

Copyright information

© Humana Press, a part of Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • James Wasmuth
    • 1
    • 2
  • Mark Blaxter
    • 1
  1. 1.Institute of Evolutionary Biology, University of EdinburghEdinburghUK
  2. 2.Program for Molecular Structure and Function, Hospital for Sick ChildrenOntarioCanada

Personalised recommendations