Advertisement

Improving Bloom Filter Performance on Sequence Data Using \(k\)-mer Bloom Filters

  • David Pellow
  • Darya Filippova
  • Carl Kingsford
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9649)

Abstract

Using a sequence’s \(k\)-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. Since \(k\)-mer sets often reach hundreds of millions of elements, traditional data structures are impractical for \(k\)-mer set storage, and Bloom filters and their variants are used instead. Bloom filters reduce the memory footprint required to store millions of \(k\)-mers while allowing for fast set containment queries, at the cost of a low false positive rate. We show that, because \(k\)-mers are derived from sequencing reads, the information about \(k\)-mer overlap in the original sequence can be used to reduce the false positive rate up to \(30{\times }\) with little or no additional memory and with set containment queries that are only 1.3–1.6 times slower. Alternatively, we can leverage \(k\)-mer overlap information to store \(k\)-mer sets in about half the space while maintaining the original false positive rate. We consider several variants of such \(k\)-mer Bloom filters (kBF), derive theoretical upper bounds for their false positive rate, and discuss their range of applications and limitations. We provide a reference implementation of kBF at https://github.com/Kingsford-Group/kbf/.

Keywords

Bloom filters Efficient data structures \(k\)-mers 

Notes

Acknowledgments

The authors want to thank Dr. Geet Duggal and Hao Wang for the many helpful discussions. This research is funded in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4554 to Carl Kingsford, by the US National Science Foundation (CCF-1256087, CCF-1319998) and by the US National Institutes of Health (R21HG006913, R01HG007104). C.K. received support as an Alfred P. Sloan Research Fellow.

References

  1. 1.
    Benoit, G., Lemaitre, C., Lavenier, D., Drezen, E., Dayris, T., Uricaru, R., Rizk, G.: Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinform. 16(1), 288 (2015)CrossRefGoogle Scholar
  2. 2.
    Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)CrossRefzbMATHGoogle Scholar
  3. 3.
    Broder, A., Mitzenmacher, M.: Network applications of Bloom filters: a survey. Internet Math. 1(4), 485–509 (2004)CrossRefzbMATHMathSciNetGoogle Scholar
  4. 4.
    Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol. Biol. 8(22), 1 (2013)Google Scholar
  5. 5.
    Heo, Y., Wu, X.L., Chen, D., Ma, J., Hwu, W.M.: BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads. Bioinformatics 30, 1354–1362 (2014)CrossRefGoogle Scholar
  6. 6.
    Holley, G., Wittler, R., Stoye, J.: Bloom filter trie – a data structure for pan-genome storage. In: Pop, M., Touzet, H. (eds.) WABI 2015. LNCS, vol. 9289, pp. 217–230. Springer, Heidelberg (2015)CrossRefGoogle Scholar
  7. 7.
    Malde, K., O’Sullivan, B.: Using Bloom filters for large scale gene sequence analysis in Haskell. In: Gill, A., Swift, T. (eds.) PADL 2009. LNCS, vol. 5418, pp. 183–194. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  8. 8.
    Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)CrossRefGoogle Scholar
  9. 9.
    Patro, R., Mount, S.M., Kingsford, C.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32(5), 462–464 (2014)CrossRefGoogle Scholar
  10. 10.
    Pell, J., Hintze, A., Canino-Koning, R., Howe, A., Tiedje, J.M., Brown, C.T.: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Nat. Acad. Sci. 109(33), 13272–13277 (2012)CrossRefzbMATHMathSciNetGoogle Scholar
  11. 11.
    Rozov, R., Shamir, R., Halperin, E.: Fast lossless compression via cascading Bloom filters. BMC Bioinform. 15(Suppl 9), S7 (2014)CrossRefGoogle Scholar
  12. 12.
    Salikhov, K., Sacomoto, G., Kucherov, G.: Using cascading Bloom filters to improve the memory usage for de Brujin graphs. In: Darling, A., Stoye, J. (eds.) WABI 2013. LNCS, vol. 8126, pp. 364–376. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  13. 13.
    Shi, H., Schmidt, B., Liu, W., Müller-Wittig, W.: Accelerating error correction in high-throughput short-read DNA sequencing data with CUDA. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2009), pp. 1–8. IEEE (2009)Google Scholar
  14. 14.
    Solomon, B., Kingsford, C.: Large-scale search of transcriptomic read sets with sequence bloom trees. bioRxiv, p. 017087 (2015)Google Scholar
  15. 15.
    Song, L., Florea, L., Langmead, B.: Lighter: fast and memory-efficient sequencing error correction without counting. Genome Biol. 15(11), 1–13 (2014)CrossRefGoogle Scholar
  16. 16.
    Stranneheim, H., Käller, M., Allander, T., Andersson, B., Arvestad, L., Lundeberg, J.: Classification of DNA sequences using Bloom filters. Bioinformatics 26(13), 1595–1600 (2010)CrossRefGoogle Scholar
  17. 17.
    Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3), R46 (2014)CrossRefGoogle Scholar
  18. 18.
    Yu, Y.W., Yorukoglu, D., Berger, B.: Traversing the k-mer landscape of NGS read datasets for quality score sparsification. In: Sharan, R. (ed.) RECOMB 2014. LNCS, vol. 8394, pp. 385–399. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  19. 19.
    Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.The Blavatnik School of Computer ScienceTel Aviv UniversityTel AvivIsrael
  2. 2.Computational Biology Department, School of Computer ScienceCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations