Skip to main content

K-Mer-Based Genome Size Estimation in Theory and Practice

  • Protocol
  • First Online:
Plant Cytogenetics and Cytogenomics

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2672))

Abstract

Recent advances in sequencing technologies have made genome sequencing of non-model organisms with very large and complex genomes possible. The data can be used to estimate diverse genome characteristics, including genome size, repeat content, and levels of heterozygosity. K-mer analysis is a powerful biocomputational approach with a wide range of applications, including estimation of genome sizes. However, interpretation of the results is not always straightforward. Here, I review k-mer-based genome size estimation, focusing specifically on k-mer theory and peak calling in k-mer frequency histograms. I highlight common pitfalls in data analysis and result interpretation, and provide a comprehensive overview on current methods and programs developed to conduct these analyses.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 229.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 299.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bennett MD, Leitch IJ (2005) Genome size evolution in plants. In: The evolution of the genome. Academic, pp 89–162

    Chapter  Google Scholar 

  2. Gregory TR (2005) Genome size evolution in animals. In: The evolution of the genome. Academic, pp 3–87

    Chapter  Google Scholar 

  3. Kullman B, Tamm H, Kullman K (2005) Fungal Genome Size Database

    Google Scholar 

  4. Pellicer J, Leitch IJ (2020) The plant DNA C-values database (release 71): an updated online repository of plant genome size data for comparative studies. New Phytol 226(2):301–305

    Article  PubMed  Google Scholar 

  5. Gregory TR (2021) Animal Genome Size Database http://www.genomesize.com

  6. Blommaert J (2020) Genome size evolution: towards new model systems for old questions. Proc R Soc B 287(1933):20201441

    Article  PubMed  PubMed Central  Google Scholar 

  7. Manekar SC, Sathe SR (2018) A benchmark study of k-mer counting methods for high-throughput sequencing. GigaScience 7(12):giy125

    PubMed  PubMed Central  Google Scholar 

  8. Reynolds G, Strnadova-Neeley V, Lachowiec J (2021) MinHash k-mer sketching highlights allopolyploid subgenome sequence differentiation. In: ISCB-Africa ASBCB. https://glfrey.github.io/files/Gillian_Reynolds_ISCB2020.pdf

  9. Sarmashghi S, Balaban M, Rachtman E, Touri B, Mirarab S, Bafna V (2021) Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT. PLoS Comput Biol 17(11):e1009449

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Zimin A, Stevens KA, Crepeau MW, Holtz-Morris A, Koriabine M, Marçais G, Puiu D, Roberts M, Wegrzyn JL, de Jong PJ, Neale DB et al (2014) Sequencing and assembly of the 22-Gb loblolly pine genome. Genetics 196(3):875–890

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Wang K, Wang J, Zhu C, Yang L, Ren Y, Ruan J, Fan G, Hu J, Xu W, Bi X, Zhu Y et al (2021) African lungfish genome sheds light on the vertebrate water-to-land transition. Cell 184(5):1362–1376

    Article  CAS  PubMed  Google Scholar 

  12. Greilhuber J, Doležel J, Lysák MA, Bennett MD (2005) The origin, evolution and proposed stabilization of the terms ‘genome size’ and ‘C-value’ to describe nuclear DNA contents. Ann Bot 95(1):255–260

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Leisner CP, Hamilton JP, Crisovan E, Manrique-Carpintero NC, Marand AP, Newton L, Pham GM, Jiang J, Douches DS, Jansky SH, Buell CR (2018) Genome sequence of M6, a diploid inbred clone of the high-glycoalkaloid-producing tuber-bearing potato species Solanum chacoense, reveals residual heterozygosity. Plant J 94(3):562–570

    Article  CAS  PubMed  Google Scholar 

  14. Graebner RC, Chen H, Contreras RN, Haynes KG, Sathuvalli V (2019) Identification of the high frequency of triploid potato resulting from tetraploid × diploid crosses. HortScience 54(7):1159–1163

    Article  CAS  Google Scholar 

  15. Hendrix B, Stewart JM (2005) Estimation of the nuclear DNA content of Gossypium species. Ann Bot 95(5):789–797

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Chao WS, Horvath DP, Anderson JV, Foley ME (2005) Potential model weeds to study genomics, ecology, and physiology in the 21st century. Weed Sci 53(6):929–937

    Article  CAS  Google Scholar 

  17. Pham GM, Hamilton JP, Wood JC, Burke JT, Zhao H, Vaillancourt B, Ou S, Jiang J, Buell CR (2020) Construction of a chromosome-scale long-read reference genome assembly for potato. GigaScience 9(9):giaa100

    Article  PubMed  PubMed Central  Google Scholar 

  18. Zhou Q, Tang D, Huang W, Yang Z, Zhang Y, Hamilton JP, Visser RG, Bachem CW, Robin Buell C, Zhang Z, Zhang C et al (2020) Haplotype-resolved genome analyses of a heterozygous diploid potato. Nat Genet 52(10):1018–1023

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Kyriakidou M, Anglin NL, Ellis D, Tai HH, Strömvik MV (2020) Genome assembly of six polyploid potato genomes. Sci Data 7(1):1–6

    Article  Google Scholar 

  20. Sun H, Jiao WB, Krause K, Campoy JA, Goel M, Folz-Donahue K, Kukat C, Huettel B, Schneeberger K (2021) Chromosome-scale and haplotype-resolved genome assembly of a tetraploid potato cultivar. bioRxiv

    Google Scholar 

  21. Wang M, Tu L, Yuan D, Zhu D, Shen C, Li J, Liu F, Pei L, Wang P, Zhao G, Ye Z et al (2019) Reference genome sequences of two cultivated allotetraploid cottons, Gossypium hirsutum and Gossypium barbadense. Nat Genet 51(2):224–229

    Article  PubMed  Google Scholar 

  22. Horvath DP, Patel S, Doğramaci M, Chao WS, Anderson JV, Foley ME, Scheffler B, Lazo G, Dorn K, Yan C, Childers A, Schatz M, Marcus S (2018) Gene space and transcriptome assemblies of leafy spurge (Euphorbia esula) identify promoter sequences, repetitive elements, high-quality markers, and a full-length chloroplast genome. Weed Sci 66(3):355–367

    Article  Google Scholar 

  23. Hardigan MA, Laimbeer FPE, Newton L, Crisovan E, Hamilton JP, Vaillancourt B, Wiegert-Rininger K, Wood JC, Douches DS, Farré EM, Veilleux RE, Buell CR (2017) Genome diversity of tuber-bearing Solanum uncovers complex evolutionary history and targets of domestication in the cultivated potato. Proc Natl Acad Sci 114(46):E9999–E10008

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Li X, Waterman MS (2003) Estimating the repeat structure and length of DNA sequences using ℓ-tuples. Genome Res 13(8):1916–1922

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Zhao Z, Ng YK, Fang X, Li S (2016) Eliminating heterozygosity from reads through coverage normalization. In: IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, pp 174–177

    Google Scholar 

  26. PGS Consortium (2011) Genome sequence and analysis of the tuber crop potato. Nature 475(7355):189–195

    Article  Google Scholar 

  27. Stoler N, Nekrutenko A (2021) Sequencing error profiles of Illumina sequencing instruments. NAR Genom Bioinform 3(1):lqab019

    Article  PubMed  PubMed Central  Google Scholar 

  28. Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, Nusbaum C, Jaffe DB (2013) Characterizing and measuring bias in sequence data. Genome Biol 14(5):1–20

    Article  Google Scholar 

  29. Liu B, Shi Y, Yuan J, Hu X, Zhang H, Li N, Li Z, Chen Y, Mu D, Fan W (2013) Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. arXiv preprint arXiv:1308.2012

    Google Scholar 

  30. Kajitani R, Toshimoto K, Noguchi H, Toyoda A, Ogura Y, Okuno M, Yabana M, Harada M, Nagayasu E, Maruyama H, Kohara Y, Fujiyama A, Hayashi T, Itoh T (2014) Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res 24(8):1384–1395

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Stevens KA, Woeste K, Chakraborty S, Crepeau MW, Leslie CA, Martínez-García PJ, Puiu D, Romero-Severson J, Coggeshall M, Dandekar AM, Kluepfel D, Neale DB, Salzberg SL, Langley CH (2018) Genomic variation among and within six Juglans species. G3: Genes, Genomes, Genetics 8(7):2153–2165

    Article  CAS  PubMed  Google Scholar 

  32. Biscotti MA, Olmo E, Heslop-Harrison JP (2015) Repetitive DNA in eukaryotic genomes. Chromosom Res 23(3):415–420

    Article  CAS  Google Scholar 

  33. Liu Q, Li X, Zhou X, Li M, Zhang F, Schwarzacher T, Heslop-Harrison JS (2019) The repetitive DNA landscape in Avena (Poaceae): chromosome and genome evolution defined by major repeat classes in whole-genome sequence reads. BMC Plant Biol 19(1):1–17

    Google Scholar 

  34. Li G, Wang L, Yang J, He H, Jin H, Li X, Ren T, Ren Z, Li F, Han X, Zhao X et al (2021) A high-quality genome assembly highlights rye genomic characteristics and agronomically important genes. Nat Genet 53(4):574–584

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Zhu L, Wu H, Li H, Tang H, Zhang L, Xu H, Jiao F, Wang N, Yang L (2021) Short tandem repeats in plants: genomic distribution and function prediction. Electron J Biotechnol 50:37–44

    Article  CAS  Google Scholar 

  36. Wang H, Liu B, Zhang Y, Jiang F, Ren Y, Yin L, Liu H, Wang S, Fan W (2020) Estimation of genome size using k-mer frequencies from corrected long reads. arXiv preprint arXiv:2003.11817

    Google Scholar 

  37. SRA toolkit: https://hpc.nih.gov/apps/sratoolkit.html (SRA Toolkit Development Team)

  38. BB-tools: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/ (Brian Bushnell)

  39. BB-tools user guide: https://jgidoegov/data-and-tools/bbtools/bb-tools-user-guide/reformat-guide/

    Google Scholar 

  40. Sandhya S, Srivastava H, Kaila T, Tyagi A, Gaikwad K (2020) Methods and tools for plant organelle genome sequencing, assembly, and downstream analysis. In: Legume Genomics. Humana, New York, pp 49–98

    Chapter  Google Scholar 

  41. FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online at: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (2015)

  42. Bolger A, Giorgi F (2014) Trimmomatic: a flexible read trimming tool for Illumina NGS data. Bioinformatics 30(15):2114–2120

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Song L, Florea L, Langmead B (2014) Lighter: fast and memory-efficient sequencing error correction without counting. Genome Biol 15(11):1–13

    Article  Google Scholar 

  44. Kelley DR, Schatz MC, Salzberg SL (2010) Quake: quality-aware detection and correction of sequencing errors. Genome Biol 11(11):1–13

    Article  Google Scholar 

  45. Wood DE, Lu J, Langmead B (2019) Improved metagenomic analysis with kraken 2. Genome Biol 20(1):1–13

    Article  Google Scholar 

  46. Marcais G, Kingsford C (2012) Jellyfish: a fast k-mer counter. Tutorialis e Manuais 1:1–8

    Google Scholar 

  47. Kokot M, Długosz M, Deorowicz S (2017) KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33(17):2759–2761

    Article  CAS  PubMed  Google Scholar 

  48. Williams D, Trimble WL, Shilts M, Meyer F, Ochman H (2013) Rapid quantification of sequence repeats to resolve the size, structure and contents of bacterial genomes. BMC Genomics 14(1):1–11

    Article  Google Scholar 

  49. Chikhi R, Medvedev P (2014) Informed and automated k-mer size selection for genome assembly. Bioinformatics 30(1):31–37

    Article  CAS  PubMed  Google Scholar 

  50. Hozza M, Vinař T, Brejová B (2015) How big is that genome? Estimating genome size and coverage from k-mer abundance spectra. In: International symposium on string processing and information retrieval. Springer, Cham, pp 199–209

    Chapter  Google Scholar 

  51. Krampl W (2018) Prediction of properties of polymorphic genomes from sequencing data. Diploma Thesis. Comenius University in Bratislava, Slovakia

    Google Scholar 

  52. Sun H, Ding J, Piednoël M, Schneeberger K (2018) FindGSE: estimating genome size variation within human and Arabidopsis using k-mer frequencies. Bioinformatics 34(4):550–557

    Article  CAS  PubMed  Google Scholar 

  53. Vurture GW, Sedlazeck FJ, Nattestad M, Underwood CJ, Fang H, Gurtowski J, Schatz MC (2017) GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics 33(14):2202–2204

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Ranallo-Benavidez TR, Jaron KS, Schatz MC (2020) GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat Commun 11(1):1–10

    Article  Google Scholar 

  55. Bohmann K, Mirarab S, Bafna V, Gilbert MTP (2020) Beyond DNA barcoding: the unrealized potential of genome skim data in sample identification. Mol Ecol 29:2521–2534

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Rice A, Glick L, Abadi S, Einhorn M, Kopelman NM, Salman-Minkov A, Mayzel J, Chay O, Mayrose I (2015) The Chromosome Counts Database (CCDB)–a community resource of plant chromosome numbers. New Phytol 206(1):19–26

    Article  PubMed  Google Scholar 

  57. Berdugo-Cely JA, Martínez-Moncayo C, Lagos-Burbano TC (2021) Genetic analysis of a potato (Solanum tuberosum L) breeding collection for southern Colombia using Single Nucleotide Polymorphism (SNP) markers. PLoS One 16(3):e0248787

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Zhang G, Fang X, Guo X, Li LI, Luo R, Xu F, Yang P, Zhang L, Wang X, Qi H, Xiong Z et al (2012) The oyster genome reveals stress adaptation and complexity of shell formation. Nature 490(7418):49–54

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgments

All biocomputational analyses were conducted at the Centre for High Performance Computing (CHPC, Cape Town, South Africa). I would like to sincerely acknowledge Brian Bushnell for advice on peak calling of tetraploid species and Rei Kajitani for kindly providing the data for the inlet of Fig. 5b.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Uljana Hesse .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Hesse, U. (2023). K-Mer-Based Genome Size Estimation in Theory and Practice. In: Heitkam, T., Garcia, S. (eds) Plant Cytogenetics and Cytogenomics. Methods in Molecular Biology, vol 2672. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-3226-0_4

Download citation

  • DOI: https://doi.org/10.1007/978-1-0716-3226-0_4

  • Published:

  • Publisher Name: Humana, New York, NY

  • Print ISBN: 978-1-0716-3225-3

  • Online ISBN: 978-1-0716-3226-0

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics