Compact Universal k-mer Hitting Sets

Orenstein, Yaron; Pellow, David; Marçais, Guillaume; Shamir, Ron; Kingsford, Carl

doi:10.1007/978-3-319-43681-4_21

Yaron Orenstein¹⁵,
David Pellow¹⁶,
Guillaume Marçais¹⁷,
Ron Shamir¹⁶ &
…
Carl Kingsford¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9838))

Included in the following conference series:

International Workshop on Algorithms in Bioinformatics

1715 Accesses
9 Citations
6 Altmetric

Abstract

We address the problem of finding a minimum-size set of k-mers that hits L-long sequences. The problem arises in the design of compact hash functions and other data structures for efficient handling of large sequencing datasets. We prove that the problem of hitting a given set of L-long sequences is NP-hard and give a heuristic solution that finds a compact universal k-mer set that hits any set of L-long sequences. The algorithm, called DOCKS (design of compact k-mer sets), works in two phases: (i) finding a minimum-size k-mer set that hits every infinite sequence; (ii) greedily adding k-mers such that together they hit all remaining L-long sequences. We show that DOCKS works well in practice and produces a set of k-mers that is much smaller than a random choice of k-mers. We present results for various values of k and sequence lengths L and by applying them to two bacterial genomes show that universal hitting k-mers improve on minimizers. The software and exemplary sets are freely available at acgt.cs.tau.ac.il/docks/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Grabowski, S., Raniszewski, M.: Sampling the suffix array with minimizers. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds.) SPIRE 2015. LNCS, vol. 9309, pp. 287–298. Springer, Heidelberg (2015)
Chapter Google Scholar
Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20, 3363–3369 (2004)
Article Google Scholar
Karkkainen, J., Ukkonen, E.: Sparse suffix trees. In: Cai, J.-Y., Wong, C.K. (eds.) COCOON 1996. LNCS, vol. 1090, pp. 219–230. Springer, Heidelberg (1996)
Chapter Google Scholar
Solomon, B., Kingsford, C.: Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34, 300–302 (2016)
Article Google Scholar
Movahedi, N.S., Forouzmand, E., Chitsaz, H.: De novo co-assembly of bacterial genomes from multiple single cells. In: 2012 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1–5 (2012)
Google Scholar
Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: fast and resource-frugal \(k\)-mer counting. Bioinformatics 31(10), 1569–1576 (2015). Oxford Univ Press
Article Google Scholar
Chikhi, R., Limasset, A., Jackman, S., Simpson, J.T., Medvedev, P.: On the representation of de Bruijn graphs. J. Comput. Biol. 22, 336–352 (2015)
Article MathSciNet Google Scholar
Li, Y., Kamousi, P., Han, F., Yang, S., Yan, X., Suri, S.: Memory efficient minimum substring partitioning. In: Proceedings of the VLDB Endowment, vol. 6, pp. 169–180. VLDB Endowment (2013)
Google Scholar
Ye, C., Ma, Z.S., Cannon, C.H., Pop, M., Douglas, W.Y.: Exploiting sparseness in de novo genome assembly. BMC Bioinform. 13, S1 (2012)
Article Google Scholar
Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014)
Article Google Scholar
Sahinalp, S.C., Vishkin, U.: Efficient approximate and dynamic matching of patterns using a labeling paradigm. In: 37th Annual Symposium on Foundations of Computer Science, Proceedings, pp. 320–328 (1996)
Google Scholar
Hach, F., Numanagi, I., Alkan, C., Sahinalp, S.C.: SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28, 3051–3057 (2012)
Article Google Scholar
Mykkeltveit, J.: A proof of Golomb’s conjecture for the de Bruijn graph. J. Comb. Theory Ser. B 13, 40–45 (1972)
Article MathSciNet MATH Google Scholar
Knuth, D.E.: Unavoidable2 (2003). http://www-cs-faculty.stanford.edu/uno/programs/unavoidable2.w
Champarnaud, J.M., Hansel, G., Perrin, D.: Unavoidable sets of constant length. Int. J. Algebra Comput. 14, 241–251 (2004)
Article MathSciNet MATH Google Scholar
Chvatal, V.: A greedy heuristic for the set-covering problem. Math. Oper. Res. 4, 233–235 (1979)
Article MathSciNet MATH Google Scholar
Karp, R.M.: Reducibility among combinatorial problems. In: Jünger, M., Liebling, T.M., Naddef, D., Nemhauser, G.L., Pulleyblank, W.R., Reinelt, G., Rinaldi, G., Wolsey, L.A. (eds.) 50 Years of Integer Programming 1958–2008, pp. 219–241. Springer, Heidelberg (2010)
Google Scholar

Download references

Acknowledgments

R.S. was supported in part by the Israel Science Foundation as part of the ISF-NSFC joint program 2015–2018. D.P. was supported in part by a Ph.D. fellowship from the Edmond J. Safra Center for Bioinformatics at Tel-Aviv University. This research is funded in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4554 to C.K., by the US National Science Foundation (CCF-1256087, CCF-1319998) and by the US National Institutes of Health (R01HG007104). C.K. received support as an Alfred P. Sloan Research Fellow. Part of this work was done while Y.O., R.S. and C.K. were visiting the Simons Institute for the Theory of Computing.

Author information

Authors and Affiliations

Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
Yaron Orenstein
Blavatnik School of Computer Science, Tel-Aviv University, Tel-aviv, Israel
David Pellow & Ron Shamir
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
Guillaume Marçais & Carl Kingsford

Authors

Yaron Orenstein
View author publications
You can also search for this author in PubMed Google Scholar
David Pellow
View author publications
You can also search for this author in PubMed Google Scholar
Guillaume Marçais
View author publications
You can also search for this author in PubMed Google Scholar
Ron Shamir
View author publications
You can also search for this author in PubMed Google Scholar
Carl Kingsford
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Ron Shamir or Carl Kingsford .

Editor information

Editors and Affiliations

AIST and University of Tokyo , Tokyo, Japan
Martin Frith
Aarhus University, Aarhus, Denmark
Christian Nørgaard Storm Pedersen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Orenstein, Y., Pellow, D., Marçais, G., Shamir, R., Kingsford, C. (2016). Compact Universal k-mer Hitting Sets. In: Frith, M., Storm Pedersen, C. (eds) Algorithms in Bioinformatics. WABI 2016. Lecture Notes in Computer Science(), vol 9838. Springer, Cham. https://doi.org/10.1007/978-3-319-43681-4_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-43681-4_21
Published: 06 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43680-7
Online ISBN: 978-3-319-43681-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics