Global Assembly of Expressed Sequence Tags

Part of the Methods in Molecular Biology book series (MIMB, volume 883)


The method for the construction of Expressed Sequence Tag (EST) assemblies described here uses reads generated from 454 pyrosequencing and Sanger and Illumina (Solexa) sequencing technologies as input. It is consistent with and parallels many established EST assembly protocols, for example the TIGR Gene Indices. Reads that are used as input to the EST assembly process usually come from both internal and external sources. Thus, in addition to internally generated EST reads, expressed transcripts are collected from dbEST and also the NCBI GenBank nucleotide database (full-length and partial cDNAs). “Virtual” transcript sequences derived from whole genome annotation projects can be excluded, depending on the needs of the project. Currently, in most cases, 454-derived sequences can be treated similar to Sanger-derived ESTs. In contrast, the shorter Solexa-derived sequences will have to undergo a round of either de novo assembly or an “align-then-assemble” approach against a reference genome, if available, before these transcripts can be used for the purpose of a global EST assembly that combines a mixture of Sanger and next-generation sequencing technologies.

Key words

Expressed Sequenced Tags Sequencing EST assembly 454 pyrosequencing Sanger Illumina Solexa 


  1. 1.
    Cheung F, Haas B, Goldberg S, May G, Xiao Y, Town C (2006) Sequencing Medicago truncatula expressed sequenced tags using 454 Life Sciences technology. BMC Genomics 7:272PubMedCrossRefGoogle Scholar
  2. 2.
    Bourdon V, Naef F, Rao P, Reuter V, Mok S, Bosl G, Koul S, Murty V, Kucherlapati R, Chaganti R (2002) Genomic and expression analysis of the 12p11-p12 amplicon using EST arrays identifies two novel amplified and overexpressed genes. Cancer Res 62:6218–6223PubMedGoogle Scholar
  3. 3.
    Ewing R, Ben Kahla A, Poirot O, Lopez F, Audic S, Claverie J (1999) Large-scale statistical analyses of rice ESTs reveal correlated patterns of gene expression. Genome Res 9:950–959PubMedCrossRefGoogle Scholar
  4. 4.
    Samuel Yang S, Cheung F, Lee J, Ha M, Wei N, Sze S, Stelly D, Thaxton P, Triplett B, Town C, Jeffrey Chen Z (2006) Accumulation of genome-specific transcripts, transcription factors and phytohormonal regulators during early stages of fiber cell development in allotetraploid cotton. Plant J 47:761–775PubMedCrossRefGoogle Scholar
  5. 5.
    Nishiyama T, Fujita T, Shin-I T, Seki M, Nishide H, Uchiyama I, Kamiya A, Carninci P, Hayashizaki Y, Shinozaki K, Kohara Y, Hasebe M (2003) Comparative genomics of Physcomitrella patens gametophytic transcriptome and Arabidopsis thaliana: implication for land plant evolution. Proc Natl Acad Sci USA 100:8007–8012PubMedCrossRefGoogle Scholar
  6. 6.
    Gupta P, Rustgi S (2004) Molecular markers from the transcribed/expressed region of the genome in higher plants. Funct Integr Genomics 4:139–162PubMedCrossRefGoogle Scholar
  7. 7.
    Mian M, Saha M, Hopkins A, Wang Z (2005) Use of tall fescue EST-SSR markers in phylogenetic analysis of cool-season forage grasses. Genome 48:637–647PubMedCrossRefGoogle Scholar
  8. 8.
    Rafalski A (2002) Applications of single nucleotide polymorphisms in crop genetics. Curr Opin Plant Biol 5:94–100PubMedCrossRefGoogle Scholar
  9. 9.
    Varshney R, Thiel T, Stein N, Langridge P, Graner A (2002) In silico analysis on frequency and distribution of microsatellites in ESTs of some cereal species. Cell Mol Biol Lett 7:537–546PubMedGoogle Scholar
  10. 10.
    Kuhl JC, Cheung F, Yuan Q, Martin W, Zewdie Y, McCallum J, Catanach A, Rutherford P, Sink KC, Jenderek M, Prince JP, Town CD, Havey MJ (2004) A unique set of 11,008 onion expressed sequence tags reveals expressed sequence and genomic differences between the monocot orders Asparagales and Poales. Plant Cell 16:114–125PubMedCrossRefGoogle Scholar
  11. 11.
    Han Y, Kang Y, Torres-Jerez I, Cheung F, Town CD, Zhao PX, Udvardi MK, Monteros MJ (2011) Genome-wide SNP discovery in tetraploid alfalfa using 454 sequencing and high resolution melting analysis. BMC Genomics 12:350CrossRefGoogle Scholar
  12. 12.
    Yang S, Tu ZJ, Cheung F, Xu WW, Lamb JF, Jung HJ, Vance CP, Gronwald JW (2011) Using RNA-Seq for gene identification, polymorphism detection and transcript profiling in two alfalfa genotypes with divergent cell wall composition in stems. BMC Genomics 12:199PubMedCrossRefGoogle Scholar
  13. 13.
    Cheung F, Win J, Lang J, Hamilton J, Vuong H, Leach J, Kamoun S, André Lévesque C, Tisserat N, Buell C (2008) Analysis of the Pythium ultimum transcriptome using Sanger and Pyrosequencing approaches. BMC Genomics 9:542PubMedCrossRefGoogle Scholar
  14. 14.
    Pertea G, Huang X, Liang F, Antonescu V, Sultana R, Karamycheva S, Lee Y, White J, Cheung F, Parvizi B, Tsai J, Quackenbush J (2003) TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets. Bioinformatics 19:651–652PubMedCrossRefGoogle Scholar
  15. 15.
    Childs K, Hamilton J, Zhu W, Ly E, Cheung F, Wu H, Rabinowicz P, Town C, Buell C, Chan A (2007) The TIGR Plant Transcript Assemblies database. Nucleic Acids Res 35:D846–D851PubMedCrossRefGoogle Scholar
  16. 16.
    Boguski M, Lowe T, Tolstoshev C (1993) dbEST-database for “expressed sequence tags”. Nat Genet 4:332–333PubMedCrossRefGoogle Scholar
  17. 17.
    Falgueras J, Lara A, Fernández-Pozo N, Cantón F, Pérez-Trabado G, Claros M (2010) SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read. BMC Bioinformatics 11:38PubMedCrossRefGoogle Scholar
  18. 18.
    Rice P, Longden I, Bleasby A (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 16:276–277PubMedCrossRefGoogle Scholar
  19. 19.
    Goecks J, Nekrutenko A, Taylor J, Team G (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 11:R86PubMedCrossRefGoogle Scholar
  20. 20.
    Langmead B, Trapnell C, Pop M, Salzberg S (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25PubMedCrossRefGoogle Scholar
  21. 21.
    Trapnell C, Pachter L, Salzberg S (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25:1105–1111PubMedCrossRefGoogle Scholar
  22. 22.
    Trapnell C, Williams B, Pertea G, Mortazavi A, Kwan G, van Baren M, Salzberg S, Wold B, Pachter L (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511–515PubMedCrossRefGoogle Scholar
  23. 23.
    Zerbino D, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829PubMedCrossRefGoogle Scholar
  24. 24.
    Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29:644–652PubMedCrossRefGoogle Scholar
  25. 25.
    Haas B, Delcher A, Mount S, Wortman J, Smith RJ, Hannick L, Maiti R, Ronning C, Rusch D, Town C, Salzberg S, White O (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31:5654–5666PubMedCrossRefGoogle Scholar
  26. 26.
    Huang X, Madan A (1999) CAP3: a DNA sequence assembly program. Genome Res 9:868–877PubMedCrossRefGoogle Scholar
  27. 27.
    Zhang Z, Schwartz S, Wagner L, Miller W (2000) A greedy algorithm for aligning DNA sequences. J Comput Biol 7:203–214PubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Center for Human Immunology, Autoimmunity, and InflammationNational Institute of HealthBethesdaUSA

Personalised recommendations