Improving Bloom Filter Performance on Sequence Data Using \(k\)-mer Bloom Filters

Conference paper

DOI: 10.1007/978-3-319-31957-5_10

Part of the Lecture Notes in Computer Science book series (LNCS, volume 9649)
Cite this paper as:
Pellow D., Filippova D., Kingsford C. (2016) Improving Bloom Filter Performance on Sequence Data Using \(k\)-mer Bloom Filters. In: Singh M. (eds) Research in Computational Molecular Biology. RECOMB 2016. Lecture Notes in Computer Science, vol 9649. Springer, Cham

Abstract

Using a sequence’s \(k\)-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. Since \(k\)-mer sets often reach hundreds of millions of elements, traditional data structures are impractical for \(k\)-mer set storage, and Bloom filters and their variants are used instead. Bloom filters reduce the memory footprint required to store millions of \(k\)-mers while allowing for fast set containment queries, at the cost of a low false positive rate. We show that, because \(k\)-mers are derived from sequencing reads, the information about \(k\)-mer overlap in the original sequence can be used to reduce the false positive rate up to \(30{\times }\) with little or no additional memory and with set containment queries that are only 1.3–1.6 times slower. Alternatively, we can leverage \(k\)-mer overlap information to store \(k\)-mer sets in about half the space while maintaining the original false positive rate. We consider several variants of such \(k\)-mer Bloom filters (kBF), derive theoretical upper bounds for their false positive rate, and discuss their range of applications and limitations. We provide a reference implementation of kBF at https://github.com/Kingsford-Group/kbf/.

Keywords

Bloom filters Efficient data structures \(k\)-mers 

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.The Blavatnik School of Computer ScienceTel Aviv UniversityTel AvivIsrael
  2. 2.Computational Biology Department, School of Computer ScienceCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations