Abstract
As a classic topic in bioinformatics, the fragment assembly problem has been studied for over two decades. Fragment assembly algorithms take a set of DNA fragments as input, piece them together into a set of aligned overlapping fragments (i.e., contigs), and output a consensus sequence for each of the contigs. The rapid advance of massively parallel sequencing, often referred to as next-generation sequencing (NGS) technologies, has revolutionized DNA sequencing by reducing both its time and cost by several orders of magnitude in the past few years, but posed new challenges for fragment assembly. As a result, many new approaches have been developed to assemble NGS sequences, which are typically shorter with a higher error rate, but at a much higher throughput, than classic methods provided. In this chapter, we review both classic and new algorithms for fragment assembly, with a focus on NGS sequences. We also discuss a few new assembly problems emerging from the broader applications of NGS techniques, which are distinct from the classic fragment assembly problem.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Sanger, F., Nicklen, S., and Coulson, A. (1977) DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences of the United States of America, 74, 5463.
Wheeler, D., et al. (2008) The complete genome of an individual by massively parallel DNA sequencing. Nature, 452, 872–876.
Bentley, D., et al. (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456, 53–59.
Wang, J., et al. (2008) The diploid genome sequence of an Asian individual. Nature, 456, 60–65.
Kim, J., et al. (2009) A highly annotated whole-genome sequence of a Korean individual. Nature, 460, 1011–1015.
Robertson, G., et al. (2007) Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nature Methods, 4, 651–657.
Wang, Z., Gerstein, M., and Snyder, M. (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics, 10, 57–63.
Lister, R., et al. (2009) Human DNA methylomes at base resolution show widespread epigenomic differences. Nature, 462, 315–322.
Ng, S., et al. (2009) Targeted capture and massively parallel sequencing of 12 human exomes. Nature, 461, 272–276.
Ronaghi, M., Uhlen, M., and Nyren, P. (1998) A sequencing method based on real-time pyrophosphate. Science(Washington), 281, 363–365.
Brenner, S., et al. (2000) Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nature biotechnology, 18, 630–634.
Huse, S., Huber, J., Morrison, H., Sogin, M., and Welch, D. (2007) Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology, 8, R143.
Miller, J., Koren, S., and Sutton, G. (2010) Assembly algorithms for next-generation sequencing data. Genomics, 95, 315–327.
Li, H., Ruan, J., and Durbin, R. (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research, 18, 1851.
Langmead, B., Trapnell, C., Pop, M., and Salzberg, S. (2009) Ultra-fast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol, 10, R25.
Li, H. and Durbin, R. (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26, 589.
Alkan, C., et al. (2009) Personalized copy number and segmental duplication maps using next-generation sequencing. Nature Genetics, 41, 1061–1067.
Homer, N., Merriman, B., and Nelson, S. (2009) BFAST: an alignment tool for large scale genome resequencing. PLoS One, 4, e7767.
Li, R., Li, Y., Kristiansen, K., and Wang, J. (2008) SOAP: short oligonucleotide alignment program. Bioinformatics, 24, 713.
Demaine, E. and Demaine, M. (2007) Jigsaw puzzles, edge matching, and polyomino packing: Connections and complexity. Graphs and Combinatorics, 23, 195–208.
Staden, R. (1979) A strategy of DNA sequencing employing computer programs. Nucleic Acids Research, 6, 2601.
Lander, E. and Waterman, M. (1988) Genomic mapping by finger-printing random clones: a mathematical analysis. Genomics, 2, 231–239.
Myers, E. (1995) Toward simplifying and accurately formulating fragment assembly. Journal of Computational Biology, 2, 275–290.
Green, P. (1994), PHRAP documentation. http://www.phrap.org/phredphrap/phrap.html
Sutton, G., White, O., Adams, M., and Kerlavage, A. (1995) TIGR Assembler: A new tool for assembling large shotgun sequencing projects. Genome Science and Technology, 1, 9–19.
Huang, X. and Madan, A. (1999) CAP3: A DNA sequence assembly program. Genome research, 9, 868.
Myers, E., et al. (2000) A whole-genome assembly of Drosophila. Science, 287, 2196.
Idury, R. and Waterman, M. (1995) A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2, 291–306.
Pevzner, P., Tang, H., and Waterman, M. (2001) An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences of the United States of America, 98, 9748.
Pop, M., Kosack, D., and Salzberg, S. (2004) Hierarchical scaffolding with Bambus. Genome Research, 14, 149.
Yang, X., Dorman, K., and Aluru, S. (2010) Reptile: Representative Tiling for Short Read Error Correction. Bioinformatics, 26, 2526
Kelley, D., Schatz, M., and Salzberg, S. (2010) Quake: quality-aware detection and correction of sequencing errors. Genome Biology, 11, R116.
Phillippy, A., Schatz, M., and Pop, M. (2008) Genome assembly forensics: finding the elusive mis-assembly. Genome Biology, 9, R55.
Choi, J., Kim, S., Tang, H., Andrews, J., Gilbert, D., and Colbourne, J. (2008) A machine-learning approach to combined evidence validation of genome assemblies. Bioinformatics, 24, 744.
Gordon, D., Abajian, C., and Green, P. (1998) Consed: a graphical tool for sequence finishing. Genome Research, 8, 195.
Nielsen, C., Cantor, M., Dubchak, I., Gordon, D., and Wang, T. (2010) Visualizing genomes: techniques and challenges. Nature Methods, 7, S5–S15.
Schatz, M., Phillippy, A., Shneiderman, B., and Salzberg, S. (2007) Hawkeye: an interactive visual analytics tool for genome assemblies. Genome Biology, 8, R34.
Velasco, R., et al. (2007) A high quality draft consensus sequence of the genome of a heterozygous grapevine variety. PLoS One, 2, 1326.
Goldberg, S., et al. (2006) A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes. Proceedings of the National Academy of Sciences, 103, 11240.
Huang, S., et al. (2009) The genome of the cucumber, Cucumis sativus L. Nature Genetics, 41, 1275–1281.
Reinhardt, J., Baltrus, D., Nishimura, M., Jeck, W., Jones, C., and Dangl, J. (2009) De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae. Genome Research, 19, 294.
Lee, S., Cheran, E., and Brudno, M. (2008) A robust framework for detecting structural variations in a genome. Bioinformatics, 24, i59.
Hormozdiari, F., Alkan, C., Eichler, E., and Sahinalp, S. (2009) Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Research, 19, 1270.
Lee, S., Hormozdiari, F., Alkan, C., and Brudno, M. (2009) MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions. Nature Methods, 6, 473–474.
Chen, K., et al. (2009) BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nature Methods, 6, 677–681.
Ye, K., Schulz, M., Long, Q., Apweiler, R., and Ning, Z. (2009) Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics, 25, 2865.
Pop, M., Phillippy, A., Delcher, A., and Salzberg, S. (2004) Comparative genome assembly. Briefings in Bioinformatics, 5, 237.
Salzberg, S., Sommer, D., Puiu, D., and Lee, V. (2008) Gene-boosted assembly of a novel bacterial genome from very short reads. PLoS Comput Biol, 4, e1000186.
Bansal, V. and Bafna, V. (2008) HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics, 24, i153.
Levy, S., et al. (2007) The diploid genome sequence of an individual human. PLoS Biol, 5, e254.
Ye, Y. and Tang, H. (2009) An orfome assembly approach to metagenomics sequences analysis. Journal of Bioinformatics and Computational Biology, 7, 455.
De Bona, F., Ossowski, S., Schneeberger, K., and Ratsch, G. (2008) Optimal spliced alignments of short sequence reads. BMC Bioinformatics, 9, O7.
Trapnell, C., Pachter, L., and Salzberg, S. (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics, 25, 1105.
Wang, K., et al. (2010) MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Research, 38, e178.
Trapnell, C., Williams, B., Pertea, G., Mortazavi, A., Kwan, G., Van Baren, M., Salzberg, S., Wold, B., and Pachter, L. (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology, 28, 511–515.
Warren, R., Sutton, G., Jones, S., and Holt, R. (2007) Assembling millions of short DNA sequences using SSAKE. Bioinformatics, 23, 500.
Jeck, W., Reinhardt, J., Baltrus, D., Hickenbotham, M., Magrini, V., Mardis, E., Dangl, J., and Jones, C. (2007) Extending assembly of short DNA sequences to handle error. Bioinformatics, 23, 2942.
Jeck, W., Reinhardt, J., Baltrus, D., Hickenbotham, M., Magrini, V., Mardis, E., Dangl, J., and Jones, C. (2007) Extending assembly of short DNA sequences to handle error. Bioinformatics, 23, 2942.
Batzoglou, S., Jaffe, D., Stanley, K., Butler, J., Gnerre, S., Mauceli, E., Berger, B., Mesirov, J., and Lander, E. (2002) ARACHNE: a whole-genome shotgun assembler. Genome Research, 12, 177.
Jaffe, D., Butler, J., Gnerre, S., Mauceli, E., Lindblad-Toh, K., Mesirov, J., Zody, M., and Lander, E. (2003) Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Research, 13, 91.
Chevreux, B., Pfisterer, T., Drescher, B., Driesel, A., Muller, W., Wetter, T., and Suhai, S. (2004) Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Research, 14, 1147.
Life Sciences (2005), Newbler.
Chaisson, M. and Pevzner, P. (2008) Short read fragment assembly of bacterial genomes. Genome Research, 18, 324.
Zerbino, D. and Birney, E. (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 18, 821.
Butler, J., MacCallum, I., Kleber, M., Shlyakhter, I., Belmonte, M., Lander, E., Nusbaum, C., and Jaffe, D. (2008) ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Research, 18, 810.
Simpson, J., Wong, K., Jackman, S., Schein, J., Jones, S., and Birol, I. (2009) ABySS: A parallel assembler for short read sequence data. Genome Research, 19, 1117.
Li, R., et al. (2010) De novo assembly of human genomes with massively parallel short read sequencing. Genome Research, 20, 265.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer Science+Business Media, LLC
About this protocol
Cite this protocol
Lee, H., Tang, H. (2012). Next-Generation Sequencing Technologies and Fragment Assembly Algorithms. In: Anisimova, M. (eds) Evolutionary Genomics. Methods in Molecular Biology, vol 855. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-61779-582-4_5
Download citation
DOI: https://doi.org/10.1007/978-1-61779-582-4_5
Published:
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-61779-581-7
Online ISBN: 978-1-61779-582-4
eBook Packages: Springer Protocols