The Contig Assembly Problem and Its Algorithmic Solutions

Jean, Géraldine; Radulescu, Andreea; Rusu, Irena

doi:10.1007/978-3-319-59826-0_12

Géraldine Jean²,
Andreea Radulescu² &
Irena Rusu²

1872 Accesses
3 Altmetric

Abstract

DNA sequencing, assuming no prior knowledge on the target DNA fragment, may be roughly described as the succession of two steps. The first of them uses some sequencing technology to output, for a given DNA fragment (not necessarily a whole genome), a collection of possibly overlapping sequences (called reads) representing small parts of the initial DNA fragment. The second one aims at recovering the sequence of the entire DNA fragment by assembling the reads.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ansorge, W.J.: Next-generation DNA sequencing techniques. New Biotechnol. 25(4), 195–203 (2009)
Article Google Scholar
Ariyaratne, P.N., Sung, W.K.: PE-Assembler: de novo assembler using short paired-end reads. Bioinformatics 27(2), 167–174 (2011)
Article Google Scholar
Bankevich, A., Nurk, S., Antipov, D., Gurevich, A.A., Dvorkin, M., Kulikov, A.S., Lesin, V.M., Nikolenko, S.I., Pham, S., Prjibelski, A.D., et al.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012)
Article MathSciNet Google Scholar
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Article MATH Google Scholar
Bradnam, K.R., Fass, J.N., Alexandrov, A., Baranay, P., Bechner, M., Birol, I., Boisvert, S., Chapman, J.A., Chapuis, G., Chikhi, R., et al.: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2(1), 1–31 (2013)
Article Google Scholar
Bresler, M., Sheehan, S., Chan, A.H., Song, Y.S.: Telescoper: de novo assembly of highly repetitive regions. Bioinformatics 28(18), i311–i317 (2012)
Article Google Scholar
Bryant, D.W., Wong, W.K., Mockler, T.C.: QSRA–a quality-value guided de novo short read assembler. BMC Bioinf. 10(1), 69 (2009)
Article Google Scholar
Butler, J., MacCallum, I., Kleber, M., Shlyakhter, I.A., Belmonte, M.K., Lander, E.S., Nusbaum, C., Jaffe, D.B.: ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18(5), 810–820 (2008)
Article Google Scholar
Carneiro, M.O., Russ, C., Ross, M.G., Gabriel, S.B., Nusbaum, C., DePristo, M.A.: Pacific biosciences sequencing technology for genotyping and variation discovery in human data. BMC Genomics 13(1), 375 (2012)
Article Google Scholar
Chaisson, M.J., Pevzner, P.A.: Short read fragment assembly of bacterial genomes. Genome Res. 18(2), 324–330 (2008)
Article Google Scholar
Chaisson, M.J., Brinza, D., Pevzner, P.A.: De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome Res. 19(2), 336–346 (2009)
Article Google Scholar
Chang, Y.J., Chen, C.C., Chen, C.L., Ho, J.M.: A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework. BMC Genomics 13(Suppl. 7), S28 (2012)
Google Scholar
Chikhi, R., Medvedev, P.: Informed and automated k-mer size selection for genome assembly. Bioinformatics 30, 31–37 (2014)
Article Google Scholar
Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a Bloom filter. In: Proceedings of WABI 2012. LNBI, vol. 7534, pp. 236–248. Springer, Heidelberg (2012)
Google Scholar
Dohm, J.C., Lottaz, C., Borodina, T., Himmelbauer, H.: SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res. 17(11), 1697–1706 (2007)
Article Google Scholar
Earl, D., Bradnam, K., John, J.S., Darling, A., Lin, D., Fass, J., Yu, H.O.K., Buffalo, V., Zerbino, D.R., Diekhans, M., et al.: Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21(12), 2224–2241 (2011)
Article Google Scholar
El-Metwally, S., Hamza, T., Zakaria, M., Helmy, M.: Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput. Biol. 9(12), e1003,345+ (2013). doi:10.1371/journal.pcbi.1003345
Google Scholar
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of FOCS’00, pp. 390–398 (2000)
Google Scholar
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to NP-Completeness. Freeman, New York (1979)
MATH Google Scholar
Gnerre, S., MacCallum, I., Przybylski, D., Ribeiro, F.J., Burton, J.N., Walker, B.J., Sharpe, T., Hall, G., Shea, T.P., Sykes, S., et al.: High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. USA 108(4), 1513–1518 (2011)
Article Google Scholar
Henson, J., Tischler, G., Ning, Z.: Next-generation sequencing and large genome assemblies. Pharmacogenomics 13(8), 901–915 (2012)
Article Google Scholar
Hernandez, D., François, P., Farinelli, L., Østerås, M., Schrenzel, J.: De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res. 18(5), 802–809 (2008)
Article Google Scholar
Hossain, M.S., Azimi, N., Skiena, S.: Crystallizing short-read assemblies around seeds. BMC Bioinform. 10(Suppl. 1), S16 (2009)
Article Google Scholar
Idury, R.M., Waterman, M.S.: A new algorithm for DNA sequence assembly. J. Comput. Biol. 2(2), 291–306 (1995)
Article Google Scholar
Jeck, W.R., Reinhardt, J.A., Baltrus, D.A., Hickenbotham, M.T., Magrini, V., Mardis, E.R., Dangl, J.L., Jones, C.D.: Extending assembly of short DNA sequences to handle error. Bioinformatics 23(21), 2942–2944 (2007)
Article Google Scholar
Kececioglu, J.D., Myers, E.W.: Combinatorial algorithms for DNA sequence assembly. Algorithmica 13, 7–51 (1993)
Article MathSciNet MATH Google Scholar
Kleftogiannis, D., Kalnis, P., Bajic, V.B.: Comparing memory-efficient genome assemblers on stand-alone and cloud infrastructures. PLoS One 8(9), e75505 (2013)
Article Google Scholar
Koren, S., Harhay, G.P., Smith, T., Bono, J.L., Harhay, D.M., Mcvey, S.D., Radune, D., Bergman, N.H., Phillippy, A.M.: Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol. 14(9), R101 (2013)
Article Google Scholar
Landau, G.M., Vishkin, U.: Introducing efficient parallelism into approximate string matching and a new serial algorithm. In: Proceedings of the Eighteenth Annual ACM Symposium on Theory of Computing, STOC ’86, pp. 220–230. ACM, New York (1986). doi:10.1145/12130.12152. http://doi.acm.org/10.1145/12130.12152
Lander, E.S., Waterman, M.S.: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2(3), 231–239 (1988)
Article Google Scholar
Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., Li, S., Shan, G., Kristiansen, K., et al.: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20(2), 265–272 (2010)
Article Google Scholar
Lin, Y., Li, J., Shen, H., Zhang, L., Papasian, C.J., et al.: Comparative studies of de novo assembly tools for next-generation sequencing technologies. Bioinformatics 27(15), 2031–2037 (2011)
Article Google Scholar
Liu, Y., Schmidt, B., Maskell, D.: Parallelized short read assembly of large genomes using de Bruijn graphs. BMC Bioinform. 12, 354 (2011)
Article Google Scholar
Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G., Chen, Y., Pan, Q., Liu, Y., et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1(1), 1–6 (2012)
Article Google Scholar
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. In: Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’90, pp. 319–327. Society for Industrial and Applied Mathematics, Philadelphia, PA (1990). http://dl.acm.org/citation.cfm?id=320176.320218
Medvedev, P., Georgiou, K., Myers, G., Brudno, M.: Computability of models for sequence assembly. In: Proceedings of WABI 2007. LNCS, vol. 4645, pp. 289–301. Springer, Heidelberg (2007)
Google Scholar
Miller, J.R., Delcher, A.L., Koren, S., Venter, E., Walenz, B.P., Brownley, A., Johnson, J., Li, K., Mobarry, C., Sutton, G.: Aggressive assembly of pyrosequencing reads with mates. Bioinformatics 24(24), 2818–2824 (2008)
Article Google Scholar
Myers, E.W.: The fragment assembly string graph. Bioinformatics 21(Suppl. 2), ii79–ii85 (2005)
Google Scholar
Myers, E.W., Sutton, G.G., Delcher, A.L., Dew, I.M., Fasulo, D.P., Flanigan, M.J., Kravitz, S.A., Mobarry, C.M., Reinert, K.H., Remington, K.A., et al.: A whole-genome assembly of Drosophila. Science 287(5461), 2196–2204 (2000)
Article Google Scholar
Nagarajan, N., Pop, M.: Parametric complexity of sequence assembly: theory and applications to next generation sequencing. J. Comput. Biol. 16(7), 897–908 (2009)
Article MathSciNet Google Scholar
Narzisi, G., Mishra, B.: Scoring-and-unfolding trimmed tree assembler: concepts, constructs and comparisons. Bioinformatics 27(2), 153–160 (2011)
Article Google Scholar
Pell, J., Hintze, A., Canino-Koning, R., Howe, A., Tiedje, J.M., Brown, C.T.: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl. Acad. Sci. USA 109(33), 13272–13277 (2012)
Article MathSciNet MATH Google Scholar
Peltola, H., Söderlund, H., Ukkonen, E.: SEQAID: a DNA sequence assembling program based on a mathematical model. Nucleic Acids Res. 12(1), 307–321 (1984)
Article Google Scholar
Peng, Y., Leung, H.C., Yiu, S.M., Chin, F.Y.: IDBA–a practical iterative de Bruijn graph de novo assembler. In: Research in Computational Molecular Biology, pp. 426–440. Springer, Heidelberg (2010)
Google Scholar
Peng, Y., Leung, H.C., Yiu, S.M., Chin, F.Y.: Idba-ud: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28(11), 1420–1428 (2012)
Article Google Scholar
Pevzner, P.A., Tang, H.: Fragment assembly with double-barreled data. Bioinformatics 17(Suppl. 1), S225–S233 (2001)
Article Google Scholar
Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA 98(17), 9748–9753 (2001)
Article MathSciNet MATH Google Scholar
Pop, M., Phillippy, A., Delcher, A.L., Salzberg, S.L.: Comparative genome assembly. Brief. Bioinform. 5(3), 237–248 (2004)
Article Google Scholar
Salzberg, S.L., Phillippy, A.M., Zimin, A., Puiu, D., Magoc, T., Koren, S., Treangen, T.J., Schatz, M.C., Delcher, A.L., Roberts, M., et al.: GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22(3), 557–567 (2012)
Article Google Scholar
Schatz, M.C., Delcher, A.L., Salzberg, S.L.: Assembly of large genomes using second-generation sequencing. Genome Res. 20(9), 1165–1173 (2010)
Article Google Scholar
Schmidt, B., Sinha, R., Beresford-Smith, B., Puglisi, S.J.: A fast hybrid short read fragment assembly algorithm. Bioinformatics 25(17), 2279–2280 (2009)
Article Google Scholar
Sim, J.S., Park, K.: The consensus string problem for a metric is NP-complete. J. Discrete Algorithms 1(1), 111–117 (2003). doi:10.1016/S1570–8667(03)00011-X
Google Scholar
Simpson, J.T., Durbin, R.: Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12), i367–i373 (2010)
Article Google Scholar
Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22(3), 549–556 (2012)
Article Google Scholar
Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009)
Article Google Scholar
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)
Article Google Scholar
Warren, R.L., Sutton, G.G., Jones, S.J., Holt, R.A.: Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23(4), 500–501 (2007)
Article Google Scholar
Ye, C., Ma, Z.S., Cannon, C.H., Pop, M., Yu, D.W.: Exploiting sparseness in de novo genome assembly. BMC Bioinform. 13(Suppl. 6), S1 (2012)
Article Google Scholar
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008)
Article Google Scholar
Zerbino, D.R., McEwen, G.K., Margulies, E.H., Birney, E.: Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short-read de Novo assembler. PLoS One 4(12), e8407 (2009)
Article Google Scholar
Zhang, W., Chen, J., Yang, Y., Tang, Y., Shang, J., Shen, B.: A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies. PLoS One bf6(3), e17,915 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

LS2N, University of Nantes, Nantes, France
Géraldine Jean, Andreea Radulescu & Irena Rusu

Authors

Géraldine Jean
View author publications
You can also search for this author in PubMed Google Scholar
Andreea Radulescu
View author publications
You can also search for this author in PubMed Google Scholar
Irena Rusu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Géraldine Jean .

Editor information

Editors and Affiliations

LaTICE, Tunis, Tunisia
Mourad Elloumi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Jean, G., Radulescu, A., Rusu, I. (2017). The Contig Assembly Problem and Its Algorithmic Solutions. In: Elloumi, M. (eds) Algorithms for Next-Generation Sequencing Data. Springer, Cham. https://doi.org/10.1007/978-3-319-59826-0_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-59826-0_12
Published: 19 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59824-6
Online ISBN: 978-3-319-59826-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics