Succinct de Bruijn Graphs

  • Alexander Bowe
  • Taku Onodera
  • Kunihiko Sadakane
  • Tetsuo Shibuya
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7534)

Abstract

We propose a new succinct de Bruijn graph representation. If the de Bruijn graph of k-mers in a DNA sequence of length N has m edges, it can be represented in 4m + o(m) bits. This is much smaller than existing ones. The numbers of outgoing and incoming edges of a node are computed in constant time, and the outgoing and incoming edge with given label are found in constant time and \(\mathcal{O}(k)\) time, respectively. The data structure is constructed in \(\mathcal{O}(Nk \log m/\log\log m)\) time using no additional space.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Batzoglou, S., Jaffe, D.B., Stanley, K., Butler, J., Gnerre, S., Mauceli, E., Berger, B., Mesirov, J.P., Lander, E.S.: Arachne: a whole-genome shotgun assembler. Genome Research 12, 177–189 (2002)CrossRefGoogle Scholar
  2. 2.
    De Bruijn, N.G.: A combinatorial problem. Koninklijke Nederlandse Akademie v. Wetenschappen 49, 758–764 (1946)MATHGoogle Scholar
  3. 3.
    Burrows, M., Wheeler, D.J.: A Block-sorting Lossless Data Compression Algorithms. Technical Report 124, Digital SRC Research Report (1994)Google Scholar
  4. 4.
    Cleary, J.G., Witten, I.H.: Data Compression Using Adaptive Coding and Partial String Matching. IEEE Trans. on Commun. COM-32(4), 396–402 (1984)Google Scholar
  5. 5.
    Conway, T.C., Bromage, A.J.: Succinct data structures for assembling large genomes. Bioinformatics 27(4), 479–486 (2011)CrossRefGoogle Scholar
  6. 6.
    Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Compressing and indexing labeled trees, with applications. Journal of the ACM 57(1), 4:1–4:33 (2009)Google Scholar
  7. 7.
    Ferragina, P., Manzini, G.: Indexing compressed texts. Journal of the ACM 52(4), 552–581 (2005)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed Representations of Sequences and Full-Text Indexes. ACM Transactions on Algorithms 3(2(20)) (2006)Google Scholar
  9. 9.
    Huang, X., Yang, S.P.: Generating a genome assembly with pcap. Current Protocols in Bioinformatics Unit 11.3 (2005)Google Scholar
  10. 10.
    Kasahara, M., Morishita, S.: Large-Scale Genome Sequence Processing. Imperial College Press (2006)Google Scholar
  11. 11.
    Li, R., Zhu, H., Ruan, J., Qjan, W., Fang, X., Shi, Z., Li, Y., Li, S., Shan, G., Kristiansen, K.: H Yang, and J. Wang. De novo assembly of human genomes with massively parallel short read sequencing. Genome Research 20, 265–272 (2009)CrossRefGoogle Scholar
  12. 12.
    MacCallum, I., Przybylski, D., Gnerre, S., Burton, J., Shlyakhter, I., Gnirke, A., Malek, J., McKernan, K., Ranade, S., Shea, T.P., Williams, L., Young, S., Nusbaum, C., Jaffe, D.B.: Allpaths 2: small genomes assembled accurately and with high continuity from short paired reads. Genome Biology 10(R103) (2009)Google Scholar
  13. 13.
    Miller, J.R., Koren, S., Sutton, G.: Assembly algorithms for next-generation sequencing data. Genomics 95, 315–327 (2010)CrossRefGoogle Scholar
  14. 14.
    Myers, E.W.: Toward simplifying and accurately formulating fragment assembly. Journal of Comutational Biology 2, 275–290 (1995)CrossRefGoogle Scholar
  15. 15.
    Myers, E.W., Sutton, G.G., Delcher, A.L., Dew, I.M., Fasulo, D.P., Flanigan, M.J., Kravitz, S.A., Mobarry, C.M., Reinert, K.H.J., Remington, K.A., Anson, E.L., Bolanos, R.A., Chou, H., Jordan, C.M., Halpern, A.L., Lonardi, S., Beasley, E.M., Brandon, R.C., Chen, L., Dunn, P.J., Lai, Z., Liang, Y., Nusskern, D.R., Zhan, M., Zhang, Q., Zheng, X., Rubin, G.M., Adams, M.D., Venter, J.C.: A whole-genome assembly of drosophila. Science 287, 2196–2204 (2000)CrossRefGoogle Scholar
  16. 16.
    Navarro, G., Sadakane, K.: Fully-functional static and dynamic succinct trees. Submitted for Journal Publication (2010). A preliminary version appeared In: Proc. ACM-SIAM SODA, pp. 134–149 (2010), http://arxiv.org/abs/0905.0768
  17. 17.
    Okanohara, D., Sadakane, K.: Practical Entropy-Compressed Rank/ Select Dictionary. In: Proc. of Workshop on Algorithm Engineering and Experiments, ALENEX (2007)Google Scholar
  18. 18.
    Pevzner, P.A., Tang, H., Waterman, M.S.: An eulerian path approach to dna fragment assembly. Proceedings of the National Academy of Sciences 98, 9748–9753 (2001)MathSciNetMATHCrossRefGoogle Scholar
  19. 19.
    Pop, M.: Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics 10(4), 354–366 (2009)CrossRefGoogle Scholar
  20. 20.
    Raman, R., Raman, V., Satti, S.R.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4) (November 2007)Google Scholar
  21. 21.
    Sahli, M., Shibuya, T.: Arapan-s: a fast and highly accurate whole-genome assembly software for viruses and small genomes. BMC Research Notes (in press)Google Scholar
  22. 22.
    Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J.: Abyss: a parallel assembler for short read sequence data. Genome Research 19, 1117–1123 (2009)CrossRefGoogle Scholar
  23. 23.
    Ye, C., Ma, Z.S., Cannon, C.H., Pop, M., Yu, D.W.: Exploiting sparseness in de novo genome assembly. BMC Bioinformatics 13(suppl. 6:S1) (2012)Google Scholar
  24. 24.
    Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Research 18, 821–829 (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Alexander Bowe
    • 1
  • Taku Onodera
    • 2
  • Kunihiko Sadakane
    • 1
  • Tetsuo Shibuya
    • 2
  1. 1.National Institute of InformaticsChiyoda-kuJapan
  2. 2.Human Genome Center, Institute of Medical ScienceUniversity of TokyoMinato-kuJapan

Personalised recommendations