K-Mer-Based Genome Size Estimation in Theory and Practice

Hesse, Uljana

doi:10.1007/978-1-0716-3226-0_4

Uljana Hesse⁴

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2672))

1088 Accesses
1 Citations

Abstract

Recent advances in sequencing technologies have made genome sequencing of non-model organisms with very large and complex genomes possible. The data can be used to estimate diverse genome characteristics, including genome size, repeat content, and levels of heterozygosity. K-mer analysis is a powerful biocomputational approach with a wide range of applications, including estimation of genome sizes. However, interpretation of the results is not always straightforward. Here, I review k-mer-based genome size estimation, focusing specifically on k-mer theory and peak calling in k-mer frequency histograms. I highlight common pitfalls in data analysis and result interpretation, and provide a comprehensive overview on current methods and programs developed to conduct these analyses.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 229.00; Price excludes VAT (USA)

Hardcover Book: USD 299.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bennett MD, Leitch IJ (2005) Genome size evolution in plants. In: The evolution of the genome. Academic, pp 89–162
Chapter Google Scholar
Gregory TR (2005) Genome size evolution in animals. In: The evolution of the genome. Academic, pp 3–87
Chapter Google Scholar
Kullman B, Tamm H, Kullman K (2005) Fungal Genome Size Database
Google Scholar
Pellicer J, Leitch IJ (2020) The plant DNA C-values database (release 71): an updated online repository of plant genome size data for comparative studies. New Phytol 226(2):301–305
Article PubMed Google Scholar
Gregory TR (2021) Animal Genome Size Database http://www.genomesize.com
Blommaert J (2020) Genome size evolution: towards new model systems for old questions. Proc R Soc B 287(1933):20201441
Article PubMed PubMed Central Google Scholar
Manekar SC, Sathe SR (2018) A benchmark study of k-mer counting methods for high-throughput sequencing. GigaScience 7(12):giy125
PubMed PubMed Central Google Scholar
Reynolds G, Strnadova-Neeley V, Lachowiec J (2021) MinHash k-mer sketching highlights allopolyploid subgenome sequence differentiation. In: ISCB-Africa ASBCB. https://glfrey.github.io/files/Gillian_Reynolds_ISCB2020.pdf
Sarmashghi S, Balaban M, Rachtman E, Touri B, Mirarab S, Bafna V (2021) Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT. PLoS Comput Biol 17(11):e1009449
Article CAS PubMed PubMed Central Google Scholar
Zimin A, Stevens KA, Crepeau MW, Holtz-Morris A, Koriabine M, Marçais G, Puiu D, Roberts M, Wegrzyn JL, de Jong PJ, Neale DB et al (2014) Sequencing and assembly of the 22-Gb loblolly pine genome. Genetics 196(3):875–890
Article CAS PubMed PubMed Central Google Scholar
Wang K, Wang J, Zhu C, Yang L, Ren Y, Ruan J, Fan G, Hu J, Xu W, Bi X, Zhu Y et al (2021) African lungfish genome sheds light on the vertebrate water-to-land transition. Cell 184(5):1362–1376
Article CAS PubMed Google Scholar
Greilhuber J, Doležel J, Lysák MA, Bennett MD (2005) The origin, evolution and proposed stabilization of the terms ‘genome size’ and ‘C-value’ to describe nuclear DNA contents. Ann Bot 95(1):255–260
Article CAS PubMed PubMed Central Google Scholar
Leisner CP, Hamilton JP, Crisovan E, Manrique-Carpintero NC, Marand AP, Newton L, Pham GM, Jiang J, Douches DS, Jansky SH, Buell CR (2018) Genome sequence of M6, a diploid inbred clone of the high-glycoalkaloid-producing tuber-bearing potato species Solanum chacoense, reveals residual heterozygosity. Plant J 94(3):562–570
Article CAS PubMed Google Scholar
Graebner RC, Chen H, Contreras RN, Haynes KG, Sathuvalli V (2019) Identification of the high frequency of triploid potato resulting from tetraploid × diploid crosses. HortScience 54(7):1159–1163
Article CAS Google Scholar
Hendrix B, Stewart JM (2005) Estimation of the nuclear DNA content of Gossypium species. Ann Bot 95(5):789–797
Article CAS PubMed PubMed Central Google Scholar
Chao WS, Horvath DP, Anderson JV, Foley ME (2005) Potential model weeds to study genomics, ecology, and physiology in the 21st century. Weed Sci 53(6):929–937
Article CAS Google Scholar
Pham GM, Hamilton JP, Wood JC, Burke JT, Zhao H, Vaillancourt B, Ou S, Jiang J, Buell CR (2020) Construction of a chromosome-scale long-read reference genome assembly for potato. GigaScience 9(9):giaa100
Article PubMed PubMed Central Google Scholar
Zhou Q, Tang D, Huang W, Yang Z, Zhang Y, Hamilton JP, Visser RG, Bachem CW, Robin Buell C, Zhang Z, Zhang C et al (2020) Haplotype-resolved genome analyses of a heterozygous diploid potato. Nat Genet 52(10):1018–1023
Article CAS PubMed PubMed Central Google Scholar
Kyriakidou M, Anglin NL, Ellis D, Tai HH, Strömvik MV (2020) Genome assembly of six polyploid potato genomes. Sci Data 7(1):1–6
Article Google Scholar
Sun H, Jiao WB, Krause K, Campoy JA, Goel M, Folz-Donahue K, Kukat C, Huettel B, Schneeberger K (2021) Chromosome-scale and haplotype-resolved genome assembly of a tetraploid potato cultivar. bioRxiv
Google Scholar
Wang M, Tu L, Yuan D, Zhu D, Shen C, Li J, Liu F, Pei L, Wang P, Zhao G, Ye Z et al (2019) Reference genome sequences of two cultivated allotetraploid cottons, Gossypium hirsutum and Gossypium barbadense. Nat Genet 51(2):224–229
Article PubMed Google Scholar
Horvath DP, Patel S, Doğramaci M, Chao WS, Anderson JV, Foley ME, Scheffler B, Lazo G, Dorn K, Yan C, Childers A, Schatz M, Marcus S (2018) Gene space and transcriptome assemblies of leafy spurge (Euphorbia esula) identify promoter sequences, repetitive elements, high-quality markers, and a full-length chloroplast genome. Weed Sci 66(3):355–367
Article Google Scholar
Hardigan MA, Laimbeer FPE, Newton L, Crisovan E, Hamilton JP, Vaillancourt B, Wiegert-Rininger K, Wood JC, Douches DS, Farré EM, Veilleux RE, Buell CR (2017) Genome diversity of tuber-bearing Solanum uncovers complex evolutionary history and targets of domestication in the cultivated potato. Proc Natl Acad Sci 114(46):E9999–E10008
Article CAS PubMed PubMed Central Google Scholar
Li X, Waterman MS (2003) Estimating the repeat structure and length of DNA sequences using ℓ-tuples. Genome Res 13(8):1916–1922
Article CAS PubMed PubMed Central Google Scholar
Zhao Z, Ng YK, Fang X, Li S (2016) Eliminating heterozygosity from reads through coverage normalization. In: IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, pp 174–177
Google Scholar
PGS Consortium (2011) Genome sequence and analysis of the tuber crop potato. Nature 475(7355):189–195
Article Google Scholar
Stoler N, Nekrutenko A (2021) Sequencing error profiles of Illumina sequencing instruments. NAR Genom Bioinform 3(1):lqab019
Article PubMed PubMed Central Google Scholar
Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, Nusbaum C, Jaffe DB (2013) Characterizing and measuring bias in sequence data. Genome Biol 14(5):1–20
Article Google Scholar
Liu B, Shi Y, Yuan J, Hu X, Zhang H, Li N, Li Z, Chen Y, Mu D, Fan W (2013) Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. arXiv preprint arXiv:1308.2012
Google Scholar
Kajitani R, Toshimoto K, Noguchi H, Toyoda A, Ogura Y, Okuno M, Yabana M, Harada M, Nagayasu E, Maruyama H, Kohara Y, Fujiyama A, Hayashi T, Itoh T (2014) Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res 24(8):1384–1395
Article CAS PubMed PubMed Central Google Scholar
Stevens KA, Woeste K, Chakraborty S, Crepeau MW, Leslie CA, Martínez-García PJ, Puiu D, Romero-Severson J, Coggeshall M, Dandekar AM, Kluepfel D, Neale DB, Salzberg SL, Langley CH (2018) Genomic variation among and within six Juglans species. G3: Genes, Genomes, Genetics 8(7):2153–2165
Article CAS PubMed Google Scholar
Biscotti MA, Olmo E, Heslop-Harrison JP (2015) Repetitive DNA in eukaryotic genomes. Chromosom Res 23(3):415–420
Article CAS Google Scholar
Liu Q, Li X, Zhou X, Li M, Zhang F, Schwarzacher T, Heslop-Harrison JS (2019) The repetitive DNA landscape in Avena (Poaceae): chromosome and genome evolution defined by major repeat classes in whole-genome sequence reads. BMC Plant Biol 19(1):1–17
Google Scholar
Li G, Wang L, Yang J, He H, Jin H, Li X, Ren T, Ren Z, Li F, Han X, Zhao X et al (2021) A high-quality genome assembly highlights rye genomic characteristics and agronomically important genes. Nat Genet 53(4):574–584
Article CAS PubMed PubMed Central Google Scholar
Zhu L, Wu H, Li H, Tang H, Zhang L, Xu H, Jiao F, Wang N, Yang L (2021) Short tandem repeats in plants: genomic distribution and function prediction. Electron J Biotechnol 50:37–44
Article CAS Google Scholar
Wang H, Liu B, Zhang Y, Jiang F, Ren Y, Yin L, Liu H, Wang S, Fan W (2020) Estimation of genome size using k-mer frequencies from corrected long reads. arXiv preprint arXiv:2003.11817
Google Scholar
SRA toolkit: https://hpc.nih.gov/apps/sratoolkit.html (SRA Toolkit Development Team)
BB-tools: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/ (Brian Bushnell)
BB-tools user guide: https://jgidoegov/data-and-tools/bbtools/bb-tools-user-guide/reformat-guide/
Google Scholar
Sandhya S, Srivastava H, Kaila T, Tyagi A, Gaikwad K (2020) Methods and tools for plant organelle genome sequencing, assembly, and downstream analysis. In: Legume Genomics. Humana, New York, pp 49–98
Chapter Google Scholar
FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online at: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (2015)
Bolger A, Giorgi F (2014) Trimmomatic: a flexible read trimming tool for Illumina NGS data. Bioinformatics 30(15):2114–2120
Article CAS PubMed PubMed Central Google Scholar
Song L, Florea L, Langmead B (2014) Lighter: fast and memory-efficient sequencing error correction without counting. Genome Biol 15(11):1–13
Article Google Scholar
Kelley DR, Schatz MC, Salzberg SL (2010) Quake: quality-aware detection and correction of sequencing errors. Genome Biol 11(11):1–13
Article Google Scholar
Wood DE, Lu J, Langmead B (2019) Improved metagenomic analysis with kraken 2. Genome Biol 20(1):1–13
Article Google Scholar
Marcais G, Kingsford C (2012) Jellyfish: a fast k-mer counter. Tutorialis e Manuais 1:1–8
Google Scholar
Kokot M, Długosz M, Deorowicz S (2017) KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33(17):2759–2761
Article CAS PubMed Google Scholar
Williams D, Trimble WL, Shilts M, Meyer F, Ochman H (2013) Rapid quantification of sequence repeats to resolve the size, structure and contents of bacterial genomes. BMC Genomics 14(1):1–11
Article Google Scholar
Chikhi R, Medvedev P (2014) Informed and automated k-mer size selection for genome assembly. Bioinformatics 30(1):31–37
Article CAS PubMed Google Scholar
Hozza M, Vinař T, Brejová B (2015) How big is that genome? Estimating genome size and coverage from k-mer abundance spectra. In: International symposium on string processing and information retrieval. Springer, Cham, pp 199–209
Chapter Google Scholar
Krampl W (2018) Prediction of properties of polymorphic genomes from sequencing data. Diploma Thesis. Comenius University in Bratislava, Slovakia
Google Scholar
Sun H, Ding J, Piednoël M, Schneeberger K (2018) FindGSE: estimating genome size variation within human and Arabidopsis using k-mer frequencies. Bioinformatics 34(4):550–557
Article CAS PubMed Google Scholar
Vurture GW, Sedlazeck FJ, Nattestad M, Underwood CJ, Fang H, Gurtowski J, Schatz MC (2017) GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics 33(14):2202–2204
Article CAS PubMed PubMed Central Google Scholar
Ranallo-Benavidez TR, Jaron KS, Schatz MC (2020) GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat Commun 11(1):1–10
Article Google Scholar
Bohmann K, Mirarab S, Bafna V, Gilbert MTP (2020) Beyond DNA barcoding: the unrealized potential of genome skim data in sample identification. Mol Ecol 29:2521–2534
Article CAS PubMed PubMed Central Google Scholar
Rice A, Glick L, Abadi S, Einhorn M, Kopelman NM, Salman-Minkov A, Mayzel J, Chay O, Mayrose I (2015) The Chromosome Counts Database (CCDB)–a community resource of plant chromosome numbers. New Phytol 206(1):19–26
Article PubMed Google Scholar
Berdugo-Cely JA, Martínez-Moncayo C, Lagos-Burbano TC (2021) Genetic analysis of a potato (Solanum tuberosum L) breeding collection for southern Colombia using Single Nucleotide Polymorphism (SNP) markers. PLoS One 16(3):e0248787
Article CAS PubMed PubMed Central Google Scholar
Zhang G, Fang X, Guo X, Li LI, Luo R, Xu F, Yang P, Zhang L, Wang X, Qi H, Xiong Z et al (2012) The oyster genome reveals stress adaptation and complexity of shell formation. Nature 490(7418):49–54
Article CAS PubMed Google Scholar

Download references

Acknowledgments

All biocomputational analyses were conducted at the Centre for High Performance Computing (CHPC, Cape Town, South Africa). I would like to sincerely acknowledge Brian Bushnell for advice on peak calling of tetraploid species and Rei Kajitani for kindly providing the data for the inlet of Fig. 5b.

Author information

Authors and Affiliations

Department of Biotechnology, University of the Western Cape, Bellville, South Africa
Uljana Hesse

Authors

Uljana Hesse
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Uljana Hesse .

Editor information

Editors and Affiliations

Institute of Botany, TU Dresden, Dresden, Germany
Tony Heitkam
Botanical Institute of Barcelona, Barcelona, Spain
Sònia Garcia

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Hesse, U. (2023). K-Mer-Based Genome Size Estimation in Theory and Practice. In: Heitkam, T., Garcia, S. (eds) Plant Cytogenetics and Cytogenomics. Methods in Molecular Biology, vol 2672. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-3226-0_4

Download citation

DOI: https://doi.org/10.1007/978-1-0716-3226-0_4
Published: 20 June 2023
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-3225-3
Online ISBN: 978-1-0716-3226-0
eBook Packages: Springer Protocols

Publish with us

Policies and ethics