Safe and Complete Contig Assembly Via Omnitigs

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9649)

Abstract

Contig assembly is the first stage that most assemblers solve when reconstructing a genome from a set of reads. Its output consists of contigs – a set of strings that are promised to appear in any genome that could have generated the reads. From the introduction of contigs 20 years ago, assemblers have tried to obtain longer and longer contigs, but the following question was never solved: given a genome graph G (e.g. a de Bruijn, or a string graph), what are all the strings that can be safely reported from G as contigs? In this paper we finally answer this question, and also give a polynomial time algorithm to find them. Our experiments show that these strings, which we call omnitigs, are 66 % to 82 % longer on average than the popular unitigs, and 29 % of dbSNP locations have more neighbors in omnitigs than in unitigs.

References

  1. 1.
    Bankevich, A., et al.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comp. Biol. 19(5), 455–477 (2012)CrossRefMathSciNetGoogle Scholar
  2. 2.
    Boetzer, M., et al.: Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27(4), 578–579 (2011)CrossRefGoogle Scholar
  3. 3.
    Boetzer, M., Pirovano, W.: Toward almost closed genomes with gapfiller. Genome Biol. 13(6), 1–9 (2012)CrossRefGoogle Scholar
  4. 4.
    Bresler, G., et al.: Optimal assembly for high throughput shotgun sequencing. BMC Bioinform. 14(Suppl 5), S18 (2013)MathSciNetGoogle Scholar
  5. 5.
    Chikhi, R., Limasset, A., Jackman, S., Simpson, J.T., Medvedev, P.: On the representation of de Bruijn graphs. In: Sharan, R. (ed.) RECOMB 2014. LNCS, vol. 8394, pp. 35–55. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  6. 6.
    Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a bloom filter. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 236–248. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  7. 7.
    Guénoche, A.: Can we recover a sequence, just knowing all its subsequences of given length? Comput. Appl. Biosci. 8(6), 569–574 (1992)Google Scholar
  8. 8.
    Haussler, D., et al.: Genome 10 K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J. Hered. 100(6), 659–674 (2008)Google Scholar
  9. 9.
    Idury, R.M., Waterman, M.S.: A new algorithm for DNA sequence assembly. J. Comp. Biol. 2(2), 291–306 (1995)CrossRefGoogle Scholar
  10. 10.
    Jackson, B.G.: Parallel methods for short read assembly. Ph.D. thesis, Iowa State University (2009)Google Scholar
  11. 11.
    Kapun, E., Tsarev, F.: De Bruijn superwalk with multiplicities problem is NP-hard. BMC Bioinform. 14(Suppl 5), S7 (2013)Google Scholar
  12. 12.
    Kapun, E., Tsarev, F.: On NP-hardness of the paired de Bruijn sound cycle problem. In: Darling, A., Stoye, J. (eds.) WABI 2013. LNCS, vol. 8126, pp. 59–69. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  13. 13.
    Kececioglu, J.D., Myers, E.W.: Combinatiorial algorithms for DNA sequence assembly. Algorithmica 13(1/2), 7–51 (1995)CrossRefMATHMathSciNetGoogle Scholar
  14. 14.
    Kececioglu, J.D.: Exact and approximation algorithms for DNA sequence reconstruction. Ph.D. thesis, University of Arizona, Tucson, AZ, USA (1992)Google Scholar
  15. 15.
    Kingsford, C., et al.: Assembly complexity of prokaryotic genomes using short reads. BMC Bioinform. 11(1), 21 (2010)CrossRefGoogle Scholar
  16. 16.
    Lam, K., et al.: Near-optimal assembly for shotgun sequencing with noisy reads. BMC Bioinform. 15(S–9), S4 (2014)CrossRefGoogle Scholar
  17. 17.
    Lander, E.S., Waterman, M.S.: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2(3), 231–239 (1988)CrossRefGoogle Scholar
  18. 18.
    Luo, R., et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1(1), 18 (2012)CrossRefGoogle Scholar
  19. 19.
    Lysov, I., et al.: Determination of the nucleotide sequence of DNA using hybridization with oligonucleotides. a new method. Dokl Akad Nauk SSSR 303(6), 1508–1511 (1988)Google Scholar
  20. 20.
    Medvedev, P., Brudno, M.: Maximum likelihood genome assembly. J. Comp. Biol. 16(8), 1101–1116 (2009)CrossRefMathSciNetGoogle Scholar
  21. 21.
    Medvedev, P., Georgiou, K., Myers, G., Brudno, M.: Computability of models for sequence assembly. In: Giancarlo, R., Hannenhalli, S. (eds.) WABI 2007. LNCS (LNBI), vol. 4645, pp. 289–301. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  22. 22.
    Medvedev, P., et al.: Paired de Bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers. J. Comp. Biol. 18(11), 1625–1634 (2011)CrossRefMathSciNetGoogle Scholar
  23. 23.
    Miller, J.R., et al.: Assembly algorithms for next-generation sequencing data. Genomics 95(6), 315–327 (2010)CrossRefGoogle Scholar
  24. 24.
    Motahari, A.S., et al.: Information theory of DNA shotgun sequencing. IEEE Trans. Inf. Theory 59(10), 6273–6289 (2013)CrossRefMathSciNetGoogle Scholar
  25. 25.
    Myers, E.W.: The fragment assembly string graph. In: ECCB/JBI, p. 85 (2005)Google Scholar
  26. 26.
    Nagarajan, N., Pop, M.: Parametric complexity of sequence assembly: theory and applications to next generation sequencing. J. Comp. Biol. 16(7), 897–908 (2009)CrossRefMathSciNetGoogle Scholar
  27. 27.
    Nagarajan, N., Pop, M.: Sequence assembly demystified. Nat. Rev. Genet. 14(3), 157–167 (2013)CrossRefGoogle Scholar
  28. 28.
    Narzisi, G., Mishra, B., Schatz, M.C.: On algorithmic complexity of biomolecular sequence assembly problem. In: Dediu, A.-H., Martín-Vide, C., Truthe, B. (eds.) AlCoB 2014. LNCS, vol. 8542, pp. 183–195. Springer, Heidelberg (2014)Google Scholar
  29. 29.
    Peltola, H., et al.: Algorithms for some string matching problems arising in molecular genetics. In: IFIP Congress, 59–64 (1983)Google Scholar
  30. 30.
    Pevzner, P.A.: L-Tuple DNA sequencing: computer analysis. J. Biomol. Struct. Dyn. 7(1), 63–73 (1989)Google Scholar
  31. 31.
    Pevzner, P.A., et al.: An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. 98(17), 9748–9753 (2001)CrossRefMATHMathSciNetGoogle Scholar
  32. 32.
    Rubinov, A.R., Gelfand, M.S.: Reconstruction of a string from substring precedence data. J. Comp. Biol. 2(2), 371–381 (1995)CrossRefGoogle Scholar
  33. 33.
    Sahlin, K., et al.: BESST-efficient scaffolding of large fragmented assemblies. BMC Bioinform. 15(1), 281 (2014)CrossRefMathSciNetGoogle Scholar
  34. 34.
    Salmela, L., Sahlin, K., Mäkinen, V., Tomescu, A.I.: Gap filling as exact path length problem. In: Przytycka, T.M. (ed.) RECOMB 2015. LNCS, vol. 9029, pp. 281–292. Springer, Heidelberg (2015)Google Scholar
  35. 35.
    Salzberg, S.L., et al.: GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012)CrossRefGoogle Scholar
  36. 36.
    Simpson, J.T., Durbin, R.: Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12), i367–i373 (2010)CrossRefGoogle Scholar
  37. 37.
    Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22, 549–556 (2012)CrossRefGoogle Scholar
  38. 38.
    Tomescu, A.I., Medvedev, P.: Safe and complete contig assembly via omnitigs (2016). http://arxiv.org/abs/1601.02932
  39. 39.
    Uricaru, R., et al.: Reference-free detection of isolated SNPs. Nucleic Acids Res. 43(2), e11 (2015)CrossRefGoogle Scholar
  40. 40.
    Vyahhi, N., Pyshkin, A., Pham, S., Pevzner, P.A.: From de Bruijn graphs to rectangle graphs for genome assembly. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 249–261. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  41. 41.
    Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1. Helsinki Institute for Information Technology HIIT,Department of Computer ScienceUniversity of HelsinkiHelsinkiFinland
  2. 2.The Pennsylvania State UniversityState CollegeUSA

Personalised recommendations