Skip to main content

Fast Approximation of Frequent k-mers and Applications to Metagenomics

Part of the Lecture Notes in Computer Science book series (LNBI,volume 11467)

Abstract

Estimating the abundances of all k-mers in a set of biological sequences is a fundamental and challenging problem with many applications in biological analysis. While several methods have been designed for the exact or approximate solution of this problem, they all require to process the entire dataset, that can be extremely expensive for high-throughput sequencing datasets. While in some applications it is crucial to estimate all k-mers and their abundances, in other situations reporting only frequent k-mers, that appear with relatively high frequency in a dataset, may suffice. This is the case, for example, in the computation of k-mers’ abundance-based distances among datasets of reads, commonly used in metagenomic analyses.

In this work, we develop, analyze, and test, a sampling-based approach, called SAKEIMA, to approximate the frequent k-mers and their frequencies in a high-throughput sequencing dataset while providing rigorous guarantees on the quality of the approximation. SAKEIMA employs an advanced sampling scheme and we show how the characterization of the VC dimension, a core concept from statistical learning theory, of a properly defined set of functions leads to practical bounds on the sample size required for a rigorous approximation. Our experimental evaluation shows that SAKEIMA allows to rigorously approximate frequent k-mers by processing only a fraction of a dataset and that the frequencies estimated by SAKEIMA lead to accurate estimates of k-mer based distances between high-throughput sequencing datasets. Overall, SAKEIMA is an efficient and rigorous tool to estimate k-mers abundances providing significant speed-ups in the analysis of large sequencing datasets.

Keywords

  • k-mer analysis
  • Sampling algorithm
  • VC dimension
  • Metagenomics

This work is supported, in part, by the University of Padova grants SID2017 and STARS: Algorithms for Inferential Data Mining.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-17083-7_13
  • Chapter length: 19 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   59.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-17083-7
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   79.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.

Notes

  1. 1.

    Available at https://github.com/VandinLab/SAKEIMA.

  2. 2.

    https://hmpdacc.org/HMASM.

  3. 3.

    https://github.com/gmarcais/Jellyfish.

  4. 4.

    Every instance of SAKEIMA and Jellyfish was executed with 1 worker, i.e., sequentially. Note that the Poisson approximation employed by SAKEIMA allows multiple workers to independently process the input k-mers, therefore SAKEIMA can be used in a parallel scenario. We will investigate the impact of parallelism in the extended version of this work.

References

  1. Benoit, G., Peterlongo, P., et al.: Multiple comparative metagenomics using multiset k-mer counting. PeerJ Comput. Sci. 2, e94 (2016)

    CrossRef  Google Scholar 

  2. Břinda, K., Sykulski, M., Kucherov, G.: Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics 31(22), 3584–3592 (2015)

    CrossRef  Google Scholar 

  3. Brown, C.T., Howe, A., et al.: A reference-free algorithm for computational normalization of shotgun sequencing data. arXiv preprint arXiv:1203.4802 (2012)

  4. Chikhi, R., Medvedev, P.: Informed and automated k-mer size selection for genome assembly. Bioinformatics 30(1), 31–37 (2013)

    CrossRef  Google Scholar 

  5. Danovaro, R., Canals, M., et al.: A submarine volcanic eruption leads to a novel microbial habitat. Nat. Ecol. Evol. 1(6), 0144 (2017)

    CrossRef  Google Scholar 

  6. Dickson, L.B., Jiolle, D., et al.: Carryover effects of larval exposure to different environmental bacteria drive adult trait variation in a mosquito vector. Sci. Adv. 3(8), e1700585 (2017)

    CrossRef  Google Scholar 

  7. Girotto, S., Pizzi, C., Comin, M.: MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics 32(17), i567–i575 (2016)

    CrossRef  Google Scholar 

  8. Hrytsenko, Y., Daniels, N.M., Schwartz, R.S.: Efficient distance calculations between genomes using mathematical approximation. In: Proceedings of the ACM-BCB, p. 546 (2018)

    Google Scholar 

  9. Kelley, D.R., Schatz, M.C., Salzberg, S.L.: Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11(11), R116 (2010)

    CrossRef  Google Scholar 

  10. Kokot, M., Długosz, M., Deorowicz, S.: KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33(17), 2759–2761 (2017)

    CrossRef  Google Scholar 

  11. Li, X., Waterman, M.S.: Estimating the repeat structure and length of DNA sequences using \(\ell \)-tuples. Genome Res. 13(8), 1916–1922 (2003)

    Google Scholar 

  12. Löffler, M., Phillips, J.M.: Shape fitting on point sets with probability distributions. In: Fiat, A., Sanders, P. (eds.) ESA 2009. LNCS, vol. 5757, pp. 313–324. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04128-0_29

    CrossRef  Google Scholar 

  13. Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)

    CrossRef  Google Scholar 

  14. Melsted, P., Halldórsson, B.V.: KmerStream: streaming algorithms for k-mer abundance estimation. Bioinformatics 30(24), 3541–3547 (2014)

    CrossRef  Google Scholar 

  15. Melsted, P., Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a Bloom filter. BMC Bioinform. 12(1), 333 (2011)

    CrossRef  Google Scholar 

  16. Mitzenmacher, M., Upfal, E.: Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis. Cambridge University Press, Cambridge (2017)

    MATH  Google Scholar 

  17. Mohamadi, H., Khan, H., Birol, I.: ntCard: a streaming algorithm for cardinality estimation in genomics data. Bioinformatics 33(9), 1324–1330 (2017)

    Google Scholar 

  18. Ondov, B.D., Treangen, T.J., et al.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17(1), 132 (2016)

    CrossRef  Google Scholar 

  19. Pandey, P., Bender, M.A., Johnson, R., Patro, R.: Squeakr: an exact and approximate k-mer counting system. Bioinformatics 34(14), 568–575 (2017)

    Google Scholar 

  20. Patro, R., Mount, S.M., Kingsford, C.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32(5), 462 (2014)

    CrossRef  Google Scholar 

  21. Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proc. National Acad. Sci. 98(17), 9748–9753 (2001)

    MathSciNet  CrossRef  Google Scholar 

  22. Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics 29(5), 652–653 (2013)

    CrossRef  Google Scholar 

  23. Roy, R.S., Bhattacharya, D., Schliep, A.: Turtle: Identifying frequent k-mers with cache-efficient algorithms. Bioinformatics 30(14), 1950–1957 (2014)

    CrossRef  Google Scholar 

  24. Salmela, L., Walve, R., Rivals, E., Ukkonen, E.: Accurate self-correction of errors in long reads using de Bruijn graphs. Bioinformatics 33(6), 799–806 (2016)

    Google Scholar 

  25. Sims, G.E., Jun, S.-R., Wu, G.A., Kim, S.-H.: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. National Acad. Sci. 106(8), 2677–2682 (2009)

    CrossRef  Google Scholar 

  26. Sivadasan, N., Srinivasan, R., Goyal, K.: Kmerlight: fast and accurate k-mer abundance estimation. arXiv preprint arXiv:1609.05626 (2016)

  27. Solomon, B., Kingsford, C.: Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34(3), 300 (2016)

    CrossRef  Google Scholar 

  28. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)

    MATH  Google Scholar 

  29. Vapnik, V., Chervonenkis, A.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Prob. Appl. 16(2), 264 (1971)

    CrossRef  Google Scholar 

  30. Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3), R46 (2014)

    CrossRef  Google Scholar 

  31. Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008)

    CrossRef  Google Scholar 

  32. Zhang, Q., Pell, J., et al.: These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PloS One 9(7), e101271 (2014)

    CrossRef  Google Scholar 

  33. Zhang, Z., Wang, W.: RNA-Skim: a rapid method for RNA-Seq quantification at transcript level. Bioinformatics 30(12), i283–i292 (2014)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fabio Vandin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Pellegrina, L., Pizzi, C., Vandin, F. (2019). Fast Approximation of Frequent k-mers and Applications to Metagenomics. In: Cowen, L. (eds) Research in Computational Molecular Biology. RECOMB 2019. Lecture Notes in Computer Science(), vol 11467. Springer, Cham. https://doi.org/10.1007/978-3-030-17083-7_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-17083-7_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-17082-0

  • Online ISBN: 978-3-030-17083-7

  • eBook Packages: Computer ScienceComputer Science (R0)