Next-Generation Sequencing Technologies and Fragment Assembly Algorithms

  • Heewook Lee
  • Haixu TangEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 855)


As a classic topic in bioinformatics, the fragment assembly problem has been studied for over two decades. Fragment assembly algorithms take a set of DNA fragments as input, piece them together into a set of aligned overlapping fragments (i.e., contigs), and output a consensus sequence for each of the contigs. The rapid advance of massively parallel sequencing, often referred to as next-generation sequencing (NGS) technologies, has revolutionized DNA sequencing by reducing both its time and cost by several orders of magnitude in the past few years, but posed new challenges for fragment assembly. As a result, many new approaches have been developed to assemble NGS sequences, which are typically shorter with a higher error rate, but at a much higher throughput, than classic methods provided. In this chapter, we review both classic and new algorithms for fragment assembly, with a focus on NGS sequences. We also discuss a few new assembly problems emerging from the broader applications of NGS techniques, which are distinct from the classic fragment assembly problem.

Key words

Next-generation sequencing Fragment assembly algorithms Genome sequencing Overlap graph de Bruijn graph 


  1. 1.
    Sanger, F., Nicklen, S., and Coulson, A. (1977) DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences of the United States of America, 74, 5463.CrossRefPubMedGoogle Scholar
  2. 2.
    Wheeler, D., et al. (2008) The complete genome of an individual by massively parallel DNA sequencing. Nature, 452, 872–876.CrossRefPubMedGoogle Scholar
  3. 3.
    Bentley, D., et al. (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456, 53–59.CrossRefPubMedGoogle Scholar
  4. 4.
    Wang, J., et al. (2008) The diploid genome sequence of an Asian individual. Nature, 456, 60–65.CrossRefPubMedGoogle Scholar
  5. 5.
    Kim, J., et al. (2009) A highly annotated whole-genome sequence of a Korean individual. Nature, 460, 1011–1015.PubMedGoogle Scholar
  6. 6.
    Robertson, G., et al. (2007) Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nature Methods, 4, 651–657.CrossRefPubMedGoogle Scholar
  7. 7.
    Wang, Z., Gerstein, M., and Snyder, M. (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics, 10, 57–63.CrossRefPubMedGoogle Scholar
  8. 8.
    Lister, R., et al. (2009) Human DNA methylomes at base resolution show widespread epigenomic differences. Nature, 462, 315–322.Google Scholar
  9. 9.
    Ng, S., et al. (2009) Targeted capture and massively parallel sequencing of 12 human exomes. Nature, 461, 272–276.CrossRefPubMedGoogle Scholar
  10. 10.
    Ronaghi, M., Uhlen, M., and Nyren, P. (1998) A sequencing method based on real-time pyrophosphate. Science(Washington), 281, 363–365.Google Scholar
  11. 11.
    Brenner, S., et al. (2000) Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nature biotechnology, 18, 630–634.CrossRefPubMedGoogle Scholar
  12. 12.
    Huse, S., Huber, J., Morrison, H., Sogin, M., and Welch, D. (2007) Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology, 8, R143.CrossRefPubMedGoogle Scholar
  13. 13.
    Miller, J., Koren, S., and Sutton, G. (2010) Assembly algorithms for next-generation sequencing data. Genomics, 95, 315–327.CrossRefPubMedGoogle Scholar
  14. 14.
    Li, H., Ruan, J., and Durbin, R. (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research, 18, 1851.CrossRefPubMedGoogle Scholar
  15. 15.
    Langmead, B., Trapnell, C., Pop, M., and Salzberg, S. (2009) Ultra-fast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol, 10, R25.CrossRefPubMedGoogle Scholar
  16. 16.
    Li, H. and Durbin, R. (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26, 589.CrossRefPubMedGoogle Scholar
  17. 17.
    Alkan, C., et al. (2009) Personalized copy number and segmental duplication maps using next-generation sequencing. Nature Genetics, 41, 1061–1067.CrossRefPubMedGoogle Scholar
  18. 18.
    Homer, N., Merriman, B., and Nelson, S. (2009) BFAST: an alignment tool for large scale genome resequencing. PLoS One, 4, e7767.CrossRefPubMedGoogle Scholar
  19. 19.
    Li, R., Li, Y., Kristiansen, K., and Wang, J. (2008) SOAP: short oligonucleotide alignment program. Bioinformatics, 24, 713.CrossRefPubMedGoogle Scholar
  20. 20.
    Demaine, E. and Demaine, M. (2007) Jigsaw puzzles, edge matching, and polyomino packing: Connections and complexity. Graphs and Combinatorics, 23, 195–208.CrossRefGoogle Scholar
  21. 21.
    Staden, R. (1979) A strategy of DNA sequencing employing computer programs. Nucleic Acids Research, 6, 2601.CrossRefPubMedGoogle Scholar
  22. 22.
    Lander, E. and Waterman, M. (1988) Genomic mapping by finger-printing random clones: a mathematical analysis. Genomics, 2, 231–239.CrossRefPubMedGoogle Scholar
  23. 23.
    Myers, E. (1995) Toward simplifying and accurately formulating fragment assembly. Journal of Computational Biology, 2, 275–290.CrossRefPubMedGoogle Scholar
  24. 24.
    Green, P. (1994), PHRAP documentation.
  25. 25.
    Sutton, G., White, O., Adams, M., and Kerlavage, A. (1995) TIGR Assembler: A new tool for assembling large shotgun sequencing projects. Genome Science and Technology, 1, 9–19.Google Scholar
  26. 26.
    Huang, X. and Madan, A. (1999) CAP3: A DNA sequence assembly program. Genome research, 9, 868.CrossRefPubMedGoogle Scholar
  27. 27.
    Myers, E., et al. (2000) A whole-genome assembly of Drosophila. Science, 287, 2196.CrossRefPubMedGoogle Scholar
  28. 28.
    Idury, R. and Waterman, M. (1995) A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2, 291–306.CrossRefPubMedGoogle Scholar
  29. 29.
    Pevzner, P., Tang, H., and Waterman, M. (2001) An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences of the United States of America, 98, 9748.CrossRefPubMedGoogle Scholar
  30. 30.
    Pop, M., Kosack, D., and Salzberg, S. (2004) Hierarchical scaffolding with Bambus. Genome Research, 14, 149.CrossRefPubMedGoogle Scholar
  31. 31.
    Yang, X., Dorman, K., and Aluru, S. (2010) Reptile: Representative Tiling for Short Read Error Correction. Bioinformatics, 26, 2526Google Scholar
  32. 32.
    Kelley, D., Schatz, M., and Salzberg, S. (2010) Quake: quality-aware detection and correction of sequencing errors. Genome Biology, 11, R116.CrossRefPubMedGoogle Scholar
  33. 33.
    Phillippy, A., Schatz, M., and Pop, M. (2008) Genome assembly forensics: finding the elusive mis-assembly. Genome Biology, 9, R55.CrossRefPubMedGoogle Scholar
  34. 34.
    Choi, J., Kim, S., Tang, H., Andrews, J., Gilbert, D., and Colbourne, J. (2008) A machine-learning approach to combined evidence validation of genome assemblies. Bioinformatics, 24, 744.CrossRefPubMedGoogle Scholar
  35. 35.
    Gordon, D., Abajian, C., and Green, P. (1998) Consed: a graphical tool for sequence finishing. Genome Research, 8, 195.PubMedGoogle Scholar
  36. 36.
    Nielsen, C., Cantor, M., Dubchak, I., Gordon, D., and Wang, T. (2010) Visualizing genomes: techniques and challenges. Nature Methods, 7, S5–S15.CrossRefPubMedGoogle Scholar
  37. 37.
    Schatz, M., Phillippy, A., Shneiderman, B., and Salzberg, S. (2007) Hawkeye: an interactive visual analytics tool for genome assemblies. Genome Biology, 8, R34.CrossRefPubMedGoogle Scholar
  38. 38.
    Velasco, R., et al. (2007) A high quality draft consensus sequence of the genome of a heterozygous grapevine variety. PLoS One, 2, 1326.CrossRefGoogle Scholar
  39. 39.
    Goldberg, S., et al. (2006) A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes. Proceedings of the National Academy of Sciences, 103, 11240.CrossRefGoogle Scholar
  40. 40.
    Huang, S., et al. (2009) The genome of the cucumber, Cucumis sativus L. Nature Genetics, 41, 1275–1281.CrossRefPubMedGoogle Scholar
  41. 41.
    Reinhardt, J., Baltrus, D., Nishimura, M., Jeck, W., Jones, C., and Dangl, J. (2009) De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae. Genome Research, 19, 294.CrossRefPubMedGoogle Scholar
  42. 42.
    Lee, S., Cheran, E., and Brudno, M. (2008) A robust framework for detecting structural variations in a genome. Bioinformatics, 24, i59.CrossRefPubMedGoogle Scholar
  43. 43.
    Hormozdiari, F., Alkan, C., Eichler, E., and Sahinalp, S. (2009) Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Research, 19, 1270.CrossRefPubMedGoogle Scholar
  44. 44.
    Lee, S., Hormozdiari, F., Alkan, C., and Brudno, M. (2009) MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions. Nature Methods, 6, 473–474.CrossRefPubMedGoogle Scholar
  45. 45.
    Chen, K., et al. (2009) BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nature Methods, 6, 677–681.CrossRefPubMedGoogle Scholar
  46. 46.
    Ye, K., Schulz, M., Long, Q., Apweiler, R., and Ning, Z. (2009) Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics, 25, 2865.CrossRefPubMedGoogle Scholar
  47. 47.
    Pop, M., Phillippy, A., Delcher, A., and Salzberg, S. (2004) Comparative genome assembly. Briefings in Bioinformatics, 5, 237.CrossRefPubMedGoogle Scholar
  48. 48.
    Salzberg, S., Sommer, D., Puiu, D., and Lee, V. (2008) Gene-boosted assembly of a novel bacterial genome from very short reads. PLoS Comput Biol, 4, e1000186.CrossRefPubMedGoogle Scholar
  49. 49.
    Bansal, V. and Bafna, V. (2008) HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics, 24, i153.CrossRefPubMedGoogle Scholar
  50. 50.
    Levy, S., et al. (2007) The diploid genome sequence of an individual human. PLoS Biol, 5, e254.CrossRefPubMedGoogle Scholar
  51. 51.
    Ye, Y. and Tang, H. (2009) An orfome assembly approach to metagenomics sequences analysis. Journal of Bioinformatics and Computational Biology, 7, 455.CrossRefPubMedGoogle Scholar
  52. 52.
    De Bona, F., Ossowski, S., Schneeberger, K., and Ratsch, G. (2008) Optimal spliced alignments of short sequence reads. BMC Bioinformatics, 9, O7.CrossRefGoogle Scholar
  53. 53.
    Trapnell, C., Pachter, L., and Salzberg, S. (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics, 25, 1105.CrossRefPubMedGoogle Scholar
  54. 54.
    Wang, K., et al. (2010) MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Research, 38, e178.Google Scholar
  55. 55.
    Trapnell, C., Williams, B., Pertea, G., Mortazavi, A., Kwan, G., Van Baren, M., Salzberg, S., Wold, B., and Pachter, L. (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology, 28, 511–515.CrossRefPubMedGoogle Scholar
  56. 56.
    Warren, R., Sutton, G., Jones, S., and Holt, R. (2007) Assembling millions of short DNA sequences using SSAKE. Bioinformatics, 23, 500.CrossRefPubMedGoogle Scholar
  57. 57.
    Jeck, W., Reinhardt, J., Baltrus, D., Hickenbotham, M., Magrini, V., Mardis, E., Dangl, J., and Jones, C. (2007) Extending assembly of short DNA sequences to handle error. Bioinformatics, 23, 2942.CrossRefPubMedGoogle Scholar
  58. 58.
    Jeck, W., Reinhardt, J., Baltrus, D., Hickenbotham, M., Magrini, V., Mardis, E., Dangl, J., and Jones, C. (2007) Extending assembly of short DNA sequences to handle error. Bioinformatics, 23, 2942.CrossRefPubMedGoogle Scholar
  59. 59.
    Batzoglou, S., Jaffe, D., Stanley, K., Butler, J., Gnerre, S., Mauceli, E., Berger, B., Mesirov, J., and Lander, E. (2002) ARACHNE: a whole-genome shotgun assembler. Genome Research, 12, 177.CrossRefPubMedGoogle Scholar
  60. 60.
    Jaffe, D., Butler, J., Gnerre, S., Mauceli, E., Lindblad-Toh, K., Mesirov, J., Zody, M., and Lander, E. (2003) Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Research, 13, 91.CrossRefPubMedGoogle Scholar
  61. 61.
    Chevreux, B., Pfisterer, T., Drescher, B., Driesel, A., Muller, W., Wetter, T., and Suhai, S. (2004) Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Research, 14, 1147.CrossRefPubMedGoogle Scholar
  62. 62.
    Life Sciences (2005), Newbler.Google Scholar
  63. 63.
    Chaisson, M. and Pevzner, P. (2008) Short read fragment assembly of bacterial genomes. Genome Research, 18, 324.CrossRefPubMedGoogle Scholar
  64. 64.
    Zerbino, D. and Birney, E. (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 18, 821.CrossRefPubMedGoogle Scholar
  65. 65.
    Butler, J., MacCallum, I., Kleber, M., Shlyakhter, I., Belmonte, M., Lander, E., Nusbaum, C., and Jaffe, D. (2008) ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Research, 18, 810.CrossRefPubMedGoogle Scholar
  66. 66.
    Simpson, J., Wong, K., Jackman, S., Schein, J., Jones, S., and Birol, I. (2009) ABySS: A parallel assembler for short read sequence data. Genome Research, 19, 1117.CrossRefPubMedGoogle Scholar
  67. 67.
    Li, R., et al. (2010) De novo assembly of human genomes with massively parallel short read sequencing. Genome Research, 20, 265.CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.School of Informatics and ComputingIndiana UniversityBloomingtonUSA

Personalised recommendations