How Big is that Genome? Estimating Genome Size and Coverage from k-mer Abundance Spectra
Many practical algorithms for sequence alignment, genome assembly and other tasks represent a sequence as a set of k-mers. Here, we address the problems of estimating genome size and sequencing coverage from sequencing reads, without the need for sequence assembly. Our estimates are based on a histogram of k-mer abundance in the input set of sequencing reads and on probabilistic modeling of distribution of k-mer abundance based on parameters related to the coverage, error rate and repeat structure of the genome. Our method provides reliable estimates even at coverage as low as 0.5 or at error rates as high as 10%.
KeywordsGenome Size Sequencing Error Genome Coverage Giant Panda Estimate Genome Size
Unable to display preview. Download preview PDF.
- Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Foundations of Computer Science (FOCS), pp. 390–398 (2000)Google Scholar
- Illumina (2015). E.coli MG1655 Illumina sequencing dataset. ftp://webdata:firstname.lastname@example.org/Data/SequencingRuns/MG1655/MiSeq_Ecoli_MG1655_110721_PF.bam (accessed: March 03, 2015)
- Li, X., Waterman, M.S.: Estimating the repeat structure and length of DNA sequences using \(\ell \)-tuples. Genome Research 13(8), 1916–1922 (2003)Google Scholar