Improving Bloom Filter Performance on Sequence Data Using \(k\)-mer Bloom Filters

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9649)

Abstract

Using a sequence’s \(k\)-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. Since \(k\)-mer sets often reach hundreds of millions of elements, traditional data structures are impractical for \(k\)-mer set storage, and Bloom filters and their variants are used instead. Bloom filters reduce the memory footprint required to store millions of \(k\)-mers while allowing for fast set containment queries, at the cost of a low false positive rate. We show that, because \(k\)-mers are derived from sequencing reads, the information about \(k\)-mer overlap in the original sequence can be used to reduce the false positive rate up to \(30{\times }\) with little or no additional memory and with set containment queries that are only 1.3–1.6 times slower. Alternatively, we can leverage \(k\)-mer overlap information to store \(k\)-mer sets in about half the space while maintaining the original false positive rate. We consider several variants of such \(k\)-mer Bloom filters (kBF), derive theoretical upper bounds for their false positive rate, and discuss their range of applications and limitations. We provide a reference implementation of kBF at https://github.com/Kingsford-Group/kbf/.

Keywords

Bloom filters Efficient data structures \(k\)-mers 

References

  1. 1.
    Benoit, G., Lemaitre, C., Lavenier, D., Drezen, E., Dayris, T., Uricaru, R., Rizk, G.: Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinform. 16(1), 288 (2015)CrossRefGoogle Scholar
  2. 2.
    Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)CrossRefMATHGoogle Scholar
  3. 3.
    Broder, A., Mitzenmacher, M.: Network applications of Bloom filters: a survey. Internet Math. 1(4), 485–509 (2004)CrossRefMATHMathSciNetGoogle Scholar
  4. 4.
    Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol. Biol. 8(22), 1 (2013)Google Scholar
  5. 5.
    Heo, Y., Wu, X.L., Chen, D., Ma, J., Hwu, W.M.: BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads. Bioinformatics 30, 1354–1362 (2014)CrossRefGoogle Scholar
  6. 6.
    Holley, G., Wittler, R., Stoye, J.: Bloom filter trie – a data structure for pan-genome storage. In: Pop, M., Touzet, H. (eds.) WABI 2015. LNCS, vol. 9289, pp. 217–230. Springer, Heidelberg (2015)CrossRefGoogle Scholar
  7. 7.
    Malde, K., O’Sullivan, B.: Using Bloom filters for large scale gene sequence analysis in Haskell. In: Gill, A., Swift, T. (eds.) PADL 2009. LNCS, vol. 5418, pp. 183–194. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  8. 8.
    Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)CrossRefGoogle Scholar
  9. 9.
    Patro, R., Mount, S.M., Kingsford, C.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32(5), 462–464 (2014)CrossRefGoogle Scholar
  10. 10.
    Pell, J., Hintze, A., Canino-Koning, R., Howe, A., Tiedje, J.M., Brown, C.T.: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Nat. Acad. Sci. 109(33), 13272–13277 (2012)CrossRefMATHMathSciNetGoogle Scholar
  11. 11.
    Rozov, R., Shamir, R., Halperin, E.: Fast lossless compression via cascading Bloom filters. BMC Bioinform. 15(Suppl 9), S7 (2014)CrossRefGoogle Scholar
  12. 12.
    Salikhov, K., Sacomoto, G., Kucherov, G.: Using cascading Bloom filters to improve the memory usage for de Brujin graphs. In: Darling, A., Stoye, J. (eds.) WABI 2013. LNCS, vol. 8126, pp. 364–376. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  13. 13.
    Shi, H., Schmidt, B., Liu, W., Müller-Wittig, W.: Accelerating error correction in high-throughput short-read DNA sequencing data with CUDA. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2009), pp. 1–8. IEEE (2009)Google Scholar
  14. 14.
    Solomon, B., Kingsford, C.: Large-scale search of transcriptomic read sets with sequence bloom trees. bioRxiv, p. 017087 (2015)Google Scholar
  15. 15.
    Song, L., Florea, L., Langmead, B.: Lighter: fast and memory-efficient sequencing error correction without counting. Genome Biol. 15(11), 1–13 (2014)CrossRefGoogle Scholar
  16. 16.
    Stranneheim, H., Käller, M., Allander, T., Andersson, B., Arvestad, L., Lundeberg, J.: Classification of DNA sequences using Bloom filters. Bioinformatics 26(13), 1595–1600 (2010)CrossRefGoogle Scholar
  17. 17.
    Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3), R46 (2014)CrossRefGoogle Scholar
  18. 18.
    Yu, Y.W., Yorukoglu, D., Berger, B.: Traversing the k-mer landscape of NGS read datasets for quality score sparsification. In: Sharan, R. (ed.) RECOMB 2014. LNCS, vol. 8394, pp. 385–399. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  19. 19.
    Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.The Blavatnik School of Computer ScienceTel Aviv UniversityTel AvivIsrael
  2. 2.Computational Biology Department, School of Computer ScienceCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations