International Symposium on String Processing and Information Retrieval

SPIRE 2015: String Processing and Information Retrieval pp 199-209 | Cite as

How Big is that Genome? Estimating Genome Size and Coverage from k-mer Abundance Spectra

  • Michal Hozza
  • Tomáš Vinař
  • Broňa Brejová
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9309)


Many practical algorithms for sequence alignment, genome assembly and other tasks represent a sequence as a set of k-mers. Here, we address the problems of estimating genome size and sequencing coverage from sequencing reads, without the need for sequence assembly. Our estimates are based on a histogram of k-mer abundance in the input set of sequencing reads and on probabilistic modeling of distribution of k-mer abundance based on parameters related to the coverage, error rate and repeat structure of the genome. Our method provides reliable estimates even at coverage as low as 0.5 or at error rates as high as 10%.


Genome Size Sequencing Error Genome Coverage Giant Panda Estimate Genome Size 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13(7), 422–426 (1970)CrossRefzbMATHGoogle Scholar
  2. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Foundations of Computer Science (FOCS), pp. 390–398 (2000)Google Scholar
  3. Illumina (2015). E.coli MG1655 Illumina sequencing dataset. (accessed: March 03, 2015)
  4. Kelley, D.R., Schatz, M.C., Salzberg, S.L., et al.: Quake: Quality-aware detection and correction of sequencing errors. Genome Biology 11(11), R116 (2010)CrossRefGoogle Scholar
  5. Kurtz, S., Narechania, A., Stein, J.C., Ware, D.: A new method to compute \(k\)-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics 9(1), 517 (2008)CrossRefGoogle Scholar
  6. Li, R., Fan, W., Tian, G., et al.: The sequence and de novo assembly of the giant panda genome. Nature 463(7279), 311–317 (2010)CrossRefGoogle Scholar
  7. Li, X., Waterman, M.S.: Estimating the repeat structure and length of DNA sequences using \(\ell \)-tuples. Genome Research 13(8), 1916–1922 (2003)Google Scholar
  8. Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of \(k\)-mers. Bioinformatics 27(6), 764–770 (2011)CrossRefGoogle Scholar
  9. Melsted, P., Pritchard, J.K.: Efficient counting of \(k\)-mers in DNA sequences using a Bloom filter. BMC Bioinformatics 12(1), 333 (2011)CrossRefGoogle Scholar
  10. Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences 98(17), 9748–9753 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  11. Sveinsson, S., Gill, N., Kane, N.C., Cronk, Q.: Transposon fingerprinting using low coverage whole genome shotgun sequencing in Cacao (Theobroma cacao L.) and related species. BMC Genomics 14(1), 502 (2013)CrossRefGoogle Scholar
  12. Wang, Y., Leung, H.C., Yiu, S.-M., Chin, F.Y.: MetaCluster 5.0: A two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinformatics 28(18), i356–i362 (2012)CrossRefGoogle Scholar
  13. Williams, D., Trimble, W.L., Shilts, M., Meyer, F., Ochman, H.: Rapid quantification of sequence repeats to resolve the size, structure and contents of bacterial genomes. BMC Genomics 14(1), 537 (2013)CrossRefGoogle Scholar
  14. Wu, Y.-W., Ye, Y.: A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. Journal of Computational Biology 18(3), 523–534 (2011)MathSciNetCrossRefGoogle Scholar
  15. Zhang, Q., Pell, J., Canino-Koning, R., Howe, A.C., Brown, C.T.: These are not the \(k\)-mers you are looking for: Efficient online \(k\)-mer counting using a probabilistic data structure. PloS One 9(7), e101271 (2014)CrossRefGoogle Scholar
  16. Zhu, C., Byrd, R.H., Lu, P., Nocedal, J.: Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software 23(4), 550–560 (1997)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Faculty of Mathematics, Physics, and InformaticsComenius UniversityBratislavaSlovakia

Personalised recommendations