Using Cascading Bloom Filters to Improve the Memory Usage for de Brujin Graphs
Abstract
De Brujin graphs are widely used in bioinformatics for processing next-generation sequencing (NGS) data. Due to the very large size of NGS datasets, it is essential to represent de Bruijn graphs compactly, and several approaches to this problem have been proposed recently. In this work, we show how to reduce the memory required by the algorithm of Chikhi and Rizk (WABI, 2012) that represents de Brujin graphs using Bloom filters. Our method requires 30% to 40% less memory with respect to their method, with insignificant impact to construction time. At the same time, our experiments showed a better query time compared to their method. This is, to our knowledge, the best practical representation for de Bruijn graphs.
Keywords
Hash Table Memory Usage Query Time Bloom Filter Construction TimePreview
Unable to display preview. Download preview PDF.
References
- 1.Blattner, F.R., Plunkett, G., Bloch, C.A., et al.: The complete genome sequence of Escherichia coli k-12. Science 277(5331), 1453–1462 (1997)CrossRefGoogle Scholar
- 2.Bowe, A., Onodera, T., Sadakane, K., Shibuya, T.: Succinct de Bruijn graphs. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 225–235. Springer, Heidelberg (2012)CrossRefGoogle Scholar
- 3.Chikhi, R., Rizk, G.: Space-efficient and exact de bruijn graph representation based on a bloom filter. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 236–248. Springer, Heidelberg (2012)CrossRefGoogle Scholar
- 4.Conway, T.C., Bromage, A.J.: Succinct data structures for assembling large genomes. Bioinformatics 27(4), 479–486 (2011)CrossRefGoogle Scholar
- 5.Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., et al.: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotech. 29(7), 644–652 (2011)CrossRefGoogle Scholar
- 6.Iqbal, Z., Caccamo, M., Turner, I., Flicek, P., McVean, G.: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44(2), 226–232 (2012)CrossRefGoogle Scholar
- 7.Kirsch, A., Mitzenmacher, M.: Less hashing, same performance: Building a better bloom filter. Random Struct. Algorithms 33(2), 187–218 (2008)MathSciNetMATHCrossRefGoogle Scholar
- 8.Miller, J.R., Koren, S., Sutton, G.: Assembly algorithms for next-generation sequencing data. Genomics 95(6), 315–327 (2010)CrossRefGoogle Scholar
- 9.Pell, J., Hintze, A., Canino-Koning, R., Howe, A., Tiedje, J.M., Brown, C.T.: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl. Acad. Sci. U.S.A. 109(33), 13272–13277 (2012)MathSciNetMATHCrossRefGoogle Scholar
- 10.Peng, Y., Leung, H.C.M., Yiu, S.M., Chin, F.Y.L.: Meta-IDBA: a de novo assembler for metagenomic data. Bioinformatics 27(13), i94–i101 (2011)Google Scholar
- 11.Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. U.S.A. 98(17), 9748–9753 (2001)MathSciNetMATHCrossRefGoogle Scholar
- 12.Porat, E.: An optimal Bloom filter replacement based on matrix solving. In: Frid, A., Morozov, A., Rybalchenko, A., Wagner, K.W. (eds.) CSR 2009. LNCS, vol. 5675, pp. 263–273. Springer, Heidelberg (2009)CrossRefGoogle Scholar
- 13.Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics (2013)Google Scholar
- 14.Sacomoto, G., Kielbassa, J., Chikhi, R., Uricaru, R., et al.: KISSPLICE: de-novo calling alternative splicing events from RNA-seq data. BMC Bioinformatics 13(suppl. 6), S5 (2012)Google Scholar
- 15.Ye, C., Ma, Z., Cannon, C., Pop, M., Yu, D.: Exploiting sparseness in de novo genome assembly. BMC Bioinformatics 13(suppl. 6), S1 (2012)Google Scholar