Skip to main content

The Contig Assembly Problem and Its Algorithmic Solutions

  • Chapter
  • First Online:
Algorithms for Next-Generation Sequencing Data

Abstract

DNA sequencing, assuming no prior knowledge on the target DNA fragment, may be roughly described as the succession of two steps. The first of them uses some sequencing technology to output, for a given DNA fragment (not necessarily a whole genome), a collection of possibly overlapping sequences (called reads) representing small parts of the initial DNA fragment. The second one aims at recovering the sequence of the entire DNA fragment by assembling the reads.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ansorge, W.J.: Next-generation DNA sequencing techniques. New Biotechnol. 25(4), 195–203 (2009)

    Article  Google Scholar 

  2. Ariyaratne, P.N., Sung, W.K.: PE-Assembler: de novo assembler using short paired-end reads. Bioinformatics 27(2), 167–174 (2011)

    Article  Google Scholar 

  3. Bankevich, A., Nurk, S., Antipov, D., Gurevich, A.A., Dvorkin, M., Kulikov, A.S., Lesin, V.M., Nikolenko, S.I., Pham, S., Prjibelski, A.D., et al.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012)

    Article  MathSciNet  Google Scholar 

  4. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)

    Article  MATH  Google Scholar 

  5. Bradnam, K.R., Fass, J.N., Alexandrov, A., Baranay, P., Bechner, M., Birol, I., Boisvert, S., Chapman, J.A., Chapuis, G., Chikhi, R., et al.: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2(1), 1–31 (2013)

    Article  Google Scholar 

  6. Bresler, M., Sheehan, S., Chan, A.H., Song, Y.S.: Telescoper: de novo assembly of highly repetitive regions. Bioinformatics 28(18), i311–i317 (2012)

    Article  Google Scholar 

  7. Bryant, D.W., Wong, W.K., Mockler, T.C.: QSRA–a quality-value guided de novo short read assembler. BMC Bioinf. 10(1), 69 (2009)

    Article  Google Scholar 

  8. Butler, J., MacCallum, I., Kleber, M., Shlyakhter, I.A., Belmonte, M.K., Lander, E.S., Nusbaum, C., Jaffe, D.B.: ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18(5), 810–820 (2008)

    Article  Google Scholar 

  9. Carneiro, M.O., Russ, C., Ross, M.G., Gabriel, S.B., Nusbaum, C., DePristo, M.A.: Pacific biosciences sequencing technology for genotyping and variation discovery in human data. BMC Genomics 13(1), 375 (2012)

    Article  Google Scholar 

  10. Chaisson, M.J., Pevzner, P.A.: Short read fragment assembly of bacterial genomes. Genome Res. 18(2), 324–330 (2008)

    Article  Google Scholar 

  11. Chaisson, M.J., Brinza, D., Pevzner, P.A.: De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome Res. 19(2), 336–346 (2009)

    Article  Google Scholar 

  12. Chang, Y.J., Chen, C.C., Chen, C.L., Ho, J.M.: A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework. BMC Genomics 13(Suppl. 7), S28 (2012)

    Google Scholar 

  13. Chikhi, R., Medvedev, P.: Informed and automated k-mer size selection for genome assembly. Bioinformatics 30, 31–37 (2014)

    Article  Google Scholar 

  14. Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a Bloom filter. In: Proceedings of WABI 2012. LNBI, vol. 7534, pp. 236–248. Springer, Heidelberg (2012)

    Google Scholar 

  15. Dohm, J.C., Lottaz, C., Borodina, T., Himmelbauer, H.: SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res. 17(11), 1697–1706 (2007)

    Article  Google Scholar 

  16. Earl, D., Bradnam, K., John, J.S., Darling, A., Lin, D., Fass, J., Yu, H.O.K., Buffalo, V., Zerbino, D.R., Diekhans, M., et al.: Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21(12), 2224–2241 (2011)

    Article  Google Scholar 

  17. El-Metwally, S., Hamza, T., Zakaria, M., Helmy, M.: Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput. Biol. 9(12), e1003,345+ (2013). doi:10.1371/journal.pcbi.1003345

    Google Scholar 

  18. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of FOCS’00, pp. 390–398 (2000)

    Google Scholar 

  19. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to NP-Completeness. Freeman, New York (1979)

    MATH  Google Scholar 

  20. Gnerre, S., MacCallum, I., Przybylski, D., Ribeiro, F.J., Burton, J.N., Walker, B.J., Sharpe, T., Hall, G., Shea, T.P., Sykes, S., et al.: High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. USA 108(4), 1513–1518 (2011)

    Article  Google Scholar 

  21. Henson, J., Tischler, G., Ning, Z.: Next-generation sequencing and large genome assemblies. Pharmacogenomics 13(8), 901–915 (2012)

    Article  Google Scholar 

  22. Hernandez, D., François, P., Farinelli, L., Østerås, M., Schrenzel, J.: De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res. 18(5), 802–809 (2008)

    Article  Google Scholar 

  23. Hossain, M.S., Azimi, N., Skiena, S.: Crystallizing short-read assemblies around seeds. BMC Bioinform. 10(Suppl. 1), S16 (2009)

    Article  Google Scholar 

  24. Idury, R.M., Waterman, M.S.: A new algorithm for DNA sequence assembly. J. Comput. Biol. 2(2), 291–306 (1995)

    Article  Google Scholar 

  25. Jeck, W.R., Reinhardt, J.A., Baltrus, D.A., Hickenbotham, M.T., Magrini, V., Mardis, E.R., Dangl, J.L., Jones, C.D.: Extending assembly of short DNA sequences to handle error. Bioinformatics 23(21), 2942–2944 (2007)

    Article  Google Scholar 

  26. Kececioglu, J.D., Myers, E.W.: Combinatorial algorithms for DNA sequence assembly. Algorithmica 13, 7–51 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  27. Kleftogiannis, D., Kalnis, P., Bajic, V.B.: Comparing memory-efficient genome assemblers on stand-alone and cloud infrastructures. PLoS One 8(9), e75505 (2013)

    Article  Google Scholar 

  28. Koren, S., Harhay, G.P., Smith, T., Bono, J.L., Harhay, D.M., Mcvey, S.D., Radune, D., Bergman, N.H., Phillippy, A.M.: Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol. 14(9), R101 (2013)

    Article  Google Scholar 

  29. Landau, G.M., Vishkin, U.: Introducing efficient parallelism into approximate string matching and a new serial algorithm. In: Proceedings of the Eighteenth Annual ACM Symposium on Theory of Computing, STOC ’86, pp. 220–230. ACM, New York (1986). doi:10.1145/12130.12152. http://doi.acm.org/10.1145/12130.12152

  30. Lander, E.S., Waterman, M.S.: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2(3), 231–239 (1988)

    Article  Google Scholar 

  31. Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., Li, S., Shan, G., Kristiansen, K., et al.: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20(2), 265–272 (2010)

    Article  Google Scholar 

  32. Lin, Y., Li, J., Shen, H., Zhang, L., Papasian, C.J., et al.: Comparative studies of de novo assembly tools for next-generation sequencing technologies. Bioinformatics 27(15), 2031–2037 (2011)

    Article  Google Scholar 

  33. Liu, Y., Schmidt, B., Maskell, D.: Parallelized short read assembly of large genomes using de Bruijn graphs. BMC Bioinform. 12, 354 (2011)

    Article  Google Scholar 

  34. Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G., Chen, Y., Pan, Q., Liu, Y., et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1(1), 1–6 (2012)

    Article  Google Scholar 

  35. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. In: Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’90, pp. 319–327. Society for Industrial and Applied Mathematics, Philadelphia, PA (1990). http://dl.acm.org/citation.cfm?id=320176.320218

  36. Medvedev, P., Georgiou, K., Myers, G., Brudno, M.: Computability of models for sequence assembly. In: Proceedings of WABI 2007. LNCS, vol. 4645, pp. 289–301. Springer, Heidelberg (2007)

    Google Scholar 

  37. Miller, J.R., Delcher, A.L., Koren, S., Venter, E., Walenz, B.P., Brownley, A., Johnson, J., Li, K., Mobarry, C., Sutton, G.: Aggressive assembly of pyrosequencing reads with mates. Bioinformatics 24(24), 2818–2824 (2008)

    Article  Google Scholar 

  38. Myers, E.W.: The fragment assembly string graph. Bioinformatics 21(Suppl. 2), ii79–ii85 (2005)

    Google Scholar 

  39. Myers, E.W., Sutton, G.G., Delcher, A.L., Dew, I.M., Fasulo, D.P., Flanigan, M.J., Kravitz, S.A., Mobarry, C.M., Reinert, K.H., Remington, K.A., et al.: A whole-genome assembly of Drosophila. Science 287(5461), 2196–2204 (2000)

    Article  Google Scholar 

  40. Nagarajan, N., Pop, M.: Parametric complexity of sequence assembly: theory and applications to next generation sequencing. J. Comput. Biol. 16(7), 897–908 (2009)

    Article  MathSciNet  Google Scholar 

  41. Narzisi, G., Mishra, B.: Scoring-and-unfolding trimmed tree assembler: concepts, constructs and comparisons. Bioinformatics 27(2), 153–160 (2011)

    Article  Google Scholar 

  42. Pell, J., Hintze, A., Canino-Koning, R., Howe, A., Tiedje, J.M., Brown, C.T.: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl. Acad. Sci. USA 109(33), 13272–13277 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  43. Peltola, H., Söderlund, H., Ukkonen, E.: SEQAID: a DNA sequence assembling program based on a mathematical model. Nucleic Acids Res. 12(1), 307–321 (1984)

    Article  Google Scholar 

  44. Peng, Y., Leung, H.C., Yiu, S.M., Chin, F.Y.: IDBA–a practical iterative de Bruijn graph de novo assembler. In: Research in Computational Molecular Biology, pp. 426–440. Springer, Heidelberg (2010)

    Google Scholar 

  45. Peng, Y., Leung, H.C., Yiu, S.M., Chin, F.Y.: Idba-ud: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28(11), 1420–1428 (2012)

    Article  Google Scholar 

  46. Pevzner, P.A., Tang, H.: Fragment assembly with double-barreled data. Bioinformatics 17(Suppl. 1), S225–S233 (2001)

    Article  Google Scholar 

  47. Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA 98(17), 9748–9753 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  48. Pop, M., Phillippy, A., Delcher, A.L., Salzberg, S.L.: Comparative genome assembly. Brief. Bioinform. 5(3), 237–248 (2004)

    Article  Google Scholar 

  49. Salzberg, S.L., Phillippy, A.M., Zimin, A., Puiu, D., Magoc, T., Koren, S., Treangen, T.J., Schatz, M.C., Delcher, A.L., Roberts, M., et al.: GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22(3), 557–567 (2012)

    Article  Google Scholar 

  50. Schatz, M.C., Delcher, A.L., Salzberg, S.L.: Assembly of large genomes using second-generation sequencing. Genome Res. 20(9), 1165–1173 (2010)

    Article  Google Scholar 

  51. Schmidt, B., Sinha, R., Beresford-Smith, B., Puglisi, S.J.: A fast hybrid short read fragment assembly algorithm. Bioinformatics 25(17), 2279–2280 (2009)

    Article  Google Scholar 

  52. Sim, J.S., Park, K.: The consensus string problem for a metric is NP-complete. J. Discrete Algorithms 1(1), 111–117 (2003). doi:10.1016/S1570–8667(03)00011-X

    Google Scholar 

  53. Simpson, J.T., Durbin, R.: Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12), i367–i373 (2010)

    Article  Google Scholar 

  54. Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22(3), 549–556 (2012)

    Article  Google Scholar 

  55. Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009)

    Article  Google Scholar 

  56. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)

    Article  Google Scholar 

  57. Warren, R.L., Sutton, G.G., Jones, S.J., Holt, R.A.: Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23(4), 500–501 (2007)

    Article  Google Scholar 

  58. Ye, C., Ma, Z.S., Cannon, C.H., Pop, M., Yu, D.W.: Exploiting sparseness in de novo genome assembly. BMC Bioinform. 13(Suppl. 6), S1 (2012)

    Article  Google Scholar 

  59. Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008)

    Article  Google Scholar 

  60. Zerbino, D.R., McEwen, G.K., Margulies, E.H., Birney, E.: Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short-read de Novo assembler. PLoS One 4(12), e8407 (2009)

    Article  Google Scholar 

  61. Zhang, W., Chen, J., Yang, Y., Tang, Y., Shang, J., Shen, B.: A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies. PLoS One bf6(3), e17,915 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Géraldine Jean .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Jean, G., Radulescu, A., Rusu, I. (2017). The Contig Assembly Problem and Its Algorithmic Solutions. In: Elloumi, M. (eds) Algorithms for Next-Generation Sequencing Data. Springer, Cham. https://doi.org/10.1007/978-3-319-59826-0_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-59826-0_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-59824-6

  • Online ISBN: 978-3-319-59826-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics