Compact Universal k-mer Hitting Sets

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9838)

Abstract

We address the problem of finding a minimum-size set of k-mers that hits L-long sequences. The problem arises in the design of compact hash functions and other data structures for efficient handling of large sequencing datasets. We prove that the problem of hitting a given set of L-long sequences is NP-hard and give a heuristic solution that finds a compact universal k-mer set that hits any set of L-long sequences. The algorithm, called DOCKS (design of compact k-mer sets), works in two phases: (i) finding a minimum-size k-mer set that hits every infinite sequence; (ii) greedily adding k-mers such that together they hit all remaining L-long sequences. We show that DOCKS works well in practice and produces a set of k-mers that is much smaller than a random choice of k-mers. We present results for various values of k and sequence lengths L and by applying them to two bacterial genomes show that universal hitting k-mers improve on minimizers. The software and exemplary sets are freely available at acgt.cs.tau.ac.il/docks/.

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of TechnologyCambridgeUSA
  2. 2.Blavatnik School of Computer ScienceTel-Aviv UniversityTel-avivIsrael
  3. 3.School of Computer ScienceCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations