Advertisement

Efficient Construction of a Compressed de Bruijn Graph for Pan-Genome Analysis

  • Timo BellerEmail author
  • Enno Ohlebusch
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9133)

Abstract

Recently, Marcus et al. (Bioinformatics 2014) proposed to use a compressed de Bruijn graph of maximal exact matches to describe the relationship between the genomes of many individuals/strains of the same or closely related species. They devised an \(O(n\log g)\) time algorithm called splitMEM that constructs this graph directly (i.e., without using the uncompressed de Bruijn graph) based on a suffix tree, where \(n\) is the total length of the genomes and \(g\) is the length of the longest genome. In this paper, we present an algorithm that outperforms their algorithm in theory and in practice. More precisely, our algorithm has a better worst-case time complexity of \(O(n\log \sigma )\), where \(\sigma \) is the size of the alphabet (\(\sigma = 4\) for DNA). Moreover, experiments show that it is much faster than splitMEM while using only a fraction of the space required by splitMEM.

Keywords

Bloom Filter Suffix Tree Start Node Adjacency List Suffix Array 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgments

This work was supported by the DFG (OH 53/6-1).

References

  1. 1.
    Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2, 53–86 (2004)zbMATHMathSciNetCrossRefGoogle Scholar
  2. 2.
    Beller, T., Gog, S., Ohlebusch, E., Schnattinger, T.: Computing the longest common prefix array based on the Burrows-Wheeler transform. J. Discrete Algorithms 18, 22–31 (2013)zbMATHMathSciNetCrossRefGoogle Scholar
  3. 3.
    Beller, T., Zwerger, M., Gog, S., Ohlebusch, E.: Space-efficient construction of the Burrows-Wheeler transform. In: Kurland, O., Lewenstein, M., Porat, E. (eds.) SPIRE 2013. LNCS, vol. 8214, pp. 5–16. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  4. 4.
    Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Research Report 124, Digital Systems Research Center (1994)Google Scholar
  5. 5.
    Cazaux, B., Lecroq, T., Rivals, E.: From indexing data structures to de Bruijn graphs. In: Kulikov, A.S., Kuznetsov, S.O., Pevzner, P. (eds.) CPM 2014. LNCS, vol. 8486, pp. 89–99. Springer, Heidelberg (2014) Google Scholar
  6. 6.
    Ferragina, P., Gagie, T., Manzini, G.: Lightweight data indexing and compression in external memory. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 697–710. Springer, Heidelberg (2010) CrossRefGoogle Scholar
  7. 7.
    Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science, pp. 390–398 (2000)Google Scholar
  8. 8.
    Gagie, T., Navarro, G., Puglisi, S.J.: New algorithms on wavelet trees and applications to information retrieval. Theoret. Comput. Sci. 426–427, 25–41 (2012)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Heidelberg (2014) Google Scholar
  10. 10.
    Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of 14th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 841–850 (2003)Google Scholar
  11. 11.
    Huang, L., Popic, V., Batzoglou, S.: Short read alignment with populations of genomes. Bioinformatics 29(13), i361–i370 (2013)CrossRefGoogle Scholar
  12. 12.
    Jacobson, G.: Space-efficient static trees and graphs. In: Proceedings of the 30th Annual IEEE Symposium on Foundations of Computer Science, pp. 549–554 (1989)Google Scholar
  13. 13.
    Kärkkäinen, J.: Fast BWT in small space by blockwise suffix sorting. Theoret. Comput. Sci. 387(3), 249–257 (2007)zbMATHMathSciNetCrossRefGoogle Scholar
  14. 14.
    Marcus, S., Lee, H., Schatz, M.C.: SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics 30(24), 3476–3483 (2014)CrossRefGoogle Scholar
  15. 15.
    Navarro, G., Ordóñez, A.: Faster compressed suffix trees for repetitive text collections. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 424–435. Springer, Heidelberg (2014) Google Scholar
  16. 16.
    Ohlebusch, E.: Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, Germany (2013) Google Scholar
  17. 17.
    Ohlebusch, E., Gog, S., Kügel, A.: Computing matching statistics and maximal exact matches on compressed full-text indexes. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 347–358. Springer, Heidelberg (2010) CrossRefGoogle Scholar
  18. 18.
    Okanohara, D., Sadakane, K.: A linear-time Burrows-Wheeler transform using induced sorting. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 90–101. Springer, Heidelberg (2009) CrossRefGoogle Scholar
  19. 19.
    Puglisi, S.J., Smyth, W.F., Turpin, A.: A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39(2), 1–31 (2007). Article 4CrossRefGoogle Scholar
  20. 20.
    Rahn, R., Weese, D., Reinert, K.: Journaled string tree-a scalable data structure for analyzing thousands of similar genomes on your laptop. Bioinformatics 30(24), 3499–3505 (2014)CrossRefGoogle Scholar
  21. 21.
    Rozowsky, J., Abyzov, A., Wang, J., Alves, P., Raha, D., Harmanci, A., Leng, J., Bjornson, R., Kong, Y., Kitabayashi, N., Bhardwaj, N., Rubin, M., Snyder, M., Gerstein, M.: AlleleSeq: Analysis of allele-specific expression and binding in a network framework. Mol. Syst. Biol. 7, 522 (2011)CrossRefGoogle Scholar
  22. 22.
    Schneeberger, K., Hagmann, J., Ossowski, S., Warthmann, N., Gesing, S., Kohlbacher, O., Weigel, D.: Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10(9), R98 (2009)CrossRefGoogle Scholar
  23. 23.
    Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008) CrossRefGoogle Scholar
  24. 24.
    Välimäki, N., Rivals, E.: Scalable and versatile k-mer indexing for high-throughput sequencing data. In: Cai, Z., Eulenstein, O., Janies, D., Schwartz, D. (eds.) ISBRA 2013. LNCS, vol. 7875, pp. 237–248. Springer, Heidelberg (2013) CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Institute of Theoretical Computer ScienceUniversity of UlmUlmGermany

Personalised recommendations