From Indexing Data Structures to de Bruijn Graphs

  • Bastien Cazaux
  • Thierry Lecroq
  • Eric Rivals
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8486)

Abstract

New technologies have tremendously increased sequencing throughput compared to traditional techniques, thereby complicating DNA assembly. Hence, assembly programs resort to de Bruijn graphs (dBG) of k-mers of short reads to compute a set of long contigs, each being a putative segment of the sequenced molecule. Other types of DNA sequence analysis, as well as preprocessing of the reads for assembly, use classical data structures to index all substrings of the reads. It is thus interesting to exhibit algorithms that directly build a dBG of order k from a pre-existing index, and especially a contracted version of the dBG, where non branching paths are condensed into single nodes. Here, we formalise the relationship between suffix trees/arrays and dBGs, and exhibit linear time algorithms for constructing the full or contracted dBGs. Finally, we provide hints explaining why this bridge between indexes and dBGs enables to dynamically update the order k of the graph.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Apostolico, A.: The myriad virtues of suffix trees. In: Apostolico, A., Galil, Z. (eds.) Combinatorial Algorithms on Words. NATO Advanced Science Institutes, Series F, vol. 12, pp. 85–96. Springer (1985)Google Scholar
  2. 2.
    Bankevich, A., Nurk, S., Antipov, D., Gurevich, A.A., et al.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology 19(5), 455–477 (2012)CrossRefMathSciNetGoogle Scholar
  3. 3.
    Bowe, A., Onodera, T., Sadakane, K., Shibuya, T.: Succinct de Bruijn Graphs. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 225–235. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  4. 4.
    Cazaux, B., Lecroq, T., Rivals, E.: From Indexing Data Structures to de Bruijn Graphs. Technical report, lirmm-00950983 (February 2014)Google Scholar
  5. 5.
    Chikhi, R., Limasset, A., Jackman, S., Simpson, J., Medvedev, P.: On the representation of de Bruijn graphs. ArXiv e-prints (January 2014)Google Scholar
  6. 6.
    Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms for Molecular Biology 8, 22 (2013)CrossRefGoogle Scholar
  7. 7.
    Conway, T.C., Bromage, A.J.: Succinct data structures for assembling large genomes. Bioinformatics 27(4), 479–486 (2011)CrossRefGoogle Scholar
  8. 8.
    de Bruijn, N.: On bases for the set of integers. Publ. Math. Debrecen 1, 232–242 (1950)MATHMathSciNetGoogle Scholar
  9. 9.
    Gusfield, D.: Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)CrossRefMATHGoogle Scholar
  10. 10.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)CrossRefMATHMathSciNetGoogle Scholar
  11. 11.
    Onodera, T., Sadakane, K., Shibuya, T.: Detecting superbubbles in assembly graphs. In: Darling, A., Stoye, J. (eds.) WABI 2013. LNCS, vol. 8126, pp. 338–348. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  12. 12.
    Pell, J., Hintze, A., Canino-Koning, R., Howe, A., Tiedje, J., Brown, C.: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl Acad. Sci. USA 109(33), 13272–13277 (2012)CrossRefMATHMathSciNetGoogle Scholar
  13. 13.
    Peng, Y., Leung, H.C.M., Yiu, S.M., Chin, F.Y.L.: IDBA – A Practical Iterative de Bruijn Graph De Novo Assembler. In: Berger, B. (ed.) RECOMB 2010. LNCS, vol. 6044, pp. 426–440. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  14. 14.
    Pevzner, P., Tang, H., Waterman, M.: An Eulerian path approach to DNA fragment assembly. Proc. Natl Acad. Sci. USA 98(17), 9748–9753 (2001)CrossRefMATHMathSciNetGoogle Scholar
  15. 15.
    Rødland, E.A.: Compact representation of k-mer de Bruijn graphs for genome read assembly. BMC Bioinformatics 14, 313 (2013)CrossRefGoogle Scholar
  16. 16.
    Salmela, L.: Correction of sequencing errors in a mixed set of reads. Bioinformatics 26(10), 1284–1290 (2010)CrossRefGoogle Scholar
  17. 17.
    Simpson, J.T., Durbin, R.: Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12), i367–i373 (2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Bastien Cazaux
    • 1
  • Thierry Lecroq
    • 2
  • Eric Rivals
    • 1
  1. 1.L.I.R.M.M. & Institut Biologie ComputationnelleUniversité de Montpellier II, CNRS U.M.R. 5506MontpellierFrance
  2. 2.LITIS EA 4108, NormaStic CNRS FR 3638Université de RouenFrance

Personalised recommendations