Compact Universal k-mer Hitting Sets

  • Yaron Orenstein
  • David Pellow
  • Guillaume Marçais
  • Ron ShamirEmail author
  • Carl KingsfordEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9838)


We address the problem of finding a minimum-size set of k-mers that hits L-long sequences. The problem arises in the design of compact hash functions and other data structures for efficient handling of large sequencing datasets. We prove that the problem of hitting a given set of L-long sequences is NP-hard and give a heuristic solution that finds a compact universal k-mer set that hits any set of L-long sequences. The algorithm, called DOCKS (design of compact k-mer sets), works in two phases: (i) finding a minimum-size k-mer set that hits every infinite sequence; (ii) greedily adding k-mers such that together they hit all remaining L-long sequences. We show that DOCKS works well in practice and produces a set of k-mers that is much smaller than a random choice of k-mers. We present results for various values of k and sequence lengths L and by applying them to two bacterial genomes show that universal hitting k-mers improve on minimizers. The software and exemplary sets are freely available at


Conjugacy Class Directed Acyclic Graph Infinite Sequence Bloom Filter Greedy Heuristic 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



R.S. was supported in part by the Israel Science Foundation as part of the ISF-NSFC joint program 2015–2018. D.P. was supported in part by a Ph.D. fellowship from the Edmond J. Safra Center for Bioinformatics at Tel-Aviv University. This research is funded in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4554 to C.K., by the US National Science Foundation (CCF-1256087, CCF-1319998) and by the US National Institutes of Health (R01HG007104). C.K. received support as an Alfred P. Sloan Research Fellow. Part of this work was done while Y.O., R.S. and C.K. were visiting the Simons Institute for the Theory of Computing.


  1. 1.
    Grabowski, S., Raniszewski, M.: Sampling the suffix array with minimizers. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds.) SPIRE 2015. LNCS, vol. 9309, pp. 287–298. Springer, Heidelberg (2015)CrossRefGoogle Scholar
  2. 2.
    Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20, 3363–3369 (2004)CrossRefGoogle Scholar
  3. 3.
    Karkkainen, J., Ukkonen, E.: Sparse suffix trees. In: Cai, J.-Y., Wong, C.K. (eds.) COCOON 1996. LNCS, vol. 1090, pp. 219–230. Springer, Heidelberg (1996)CrossRefGoogle Scholar
  4. 4.
    Solomon, B., Kingsford, C.: Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34, 300–302 (2016)CrossRefGoogle Scholar
  5. 5.
    Movahedi, N.S., Forouzmand, E., Chitsaz, H.: De novo co-assembly of bacterial genomes from multiple single cells. In: 2012 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1–5 (2012)Google Scholar
  6. 6.
    Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: fast and resource-frugal \(k\)-mer counting. Bioinformatics 31(10), 1569–1576 (2015). Oxford Univ PressCrossRefGoogle Scholar
  7. 7.
    Chikhi, R., Limasset, A., Jackman, S., Simpson, J.T., Medvedev, P.: On the representation of de Bruijn graphs. J. Comput. Biol. 22, 336–352 (2015)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Li, Y., Kamousi, P., Han, F., Yang, S., Yan, X., Suri, S.: Memory efficient minimum substring partitioning. In: Proceedings of the VLDB Endowment, vol. 6, pp. 169–180. VLDB Endowment (2013)Google Scholar
  9. 9.
    Ye, C., Ma, Z.S., Cannon, C.H., Pop, M., Douglas, W.Y.: Exploiting sparseness in de novo genome assembly. BMC Bioinform. 13, S1 (2012)CrossRefGoogle Scholar
  10. 10.
    Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014)CrossRefGoogle Scholar
  11. 11.
    Sahinalp, S.C., Vishkin, U.: Efficient approximate and dynamic matching of patterns using a labeling paradigm. In: 37th Annual Symposium on Foundations of Computer Science, Proceedings, pp. 320–328 (1996)Google Scholar
  12. 12.
    Hach, F., Numanagi, I., Alkan, C., Sahinalp, S.C.: SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28, 3051–3057 (2012)CrossRefGoogle Scholar
  13. 13.
    Mykkeltveit, J.: A proof of Golomb’s conjecture for the de Bruijn graph. J. Comb. Theory Ser. B 13, 40–45 (1972)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
  15. 15.
    Champarnaud, J.M., Hansel, G., Perrin, D.: Unavoidable sets of constant length. Int. J. Algebra Comput. 14, 241–251 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Chvatal, V.: A greedy heuristic for the set-covering problem. Math. Oper. Res. 4, 233–235 (1979)MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Karp, R.M.: Reducibility among combinatorial problems. In: Jünger, M., Liebling, T.M., Naddef, D., Nemhauser, G.L., Pulleyblank, W.R., Reinelt, G., Rinaldi, G., Wolsey, L.A. (eds.) 50 Years of Integer Programming 1958–2008, pp. 219–241. Springer, Heidelberg (2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of TechnologyCambridgeUSA
  2. 2.Blavatnik School of Computer ScienceTel-Aviv UniversityTel-avivIsrael
  3. 3.School of Computer ScienceCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations