Comprehensive genotyping of Brazilian Cassava (Manihot esculenta Crantz) Germplasm Bank: insights into diversification and domestication

Cassava (Manihot esculenta Crantz) is a major staple root crop of the tropics, originating from the Amazonas region. In this study, 3,354 cassava landraces and modern breeding lines from the Embrapa Cassava Germplasm Bank (CGB) were characterized. All individuals were subjected to genotyping-by-sequencing (GBS), identifying 27,045 Single Nucleotide Polymorphisms (SNPs). Identity-by-state and population structure analyses revealed a unique set of 1,536 individuals and 10 distinct genetic groups with heterogeneous linkage disequilibrium (LD). On this basis, 1,300 to 4,700 SNP markers were selected for large quantitative trait loci (QTL) detection. Identified genetic groups were further characterized for population genetics parameters including minor allele frequency (MAF), observed heterozygosity (Ho), effective population size estimate and polymorphism information content (PIC). Selection footprints and introgressions of M. glaziovii were detected. Spatial population structure analysis revealed five ancestral populations related to distinct Brazilian ecoregions. Estimation of historical relationships among identified populations suggest earliest population split from Amazonas to Atlantic forest and Caatinga eco-regions and active gene flows. This study provides a thorough genetic characterization of ex situ germplasm resources from cassava center of origin, South America, with results shedding light on Brazilian cassava characteristics and its biogeographical landscape. These Findings support and facilitate the use of genetic resources in modern breeding programs including implementation of association mapping and genomic selection strategies. Key message Brazilian cassava diversity was characterized through population genetics and clustering approaches, highlighting contrasted genetic groups, and spatial genetic differentiation.

Linkage Disequilibrium (LD) defines, on average, the required number of SNP markers and mapping resolution 22" W, 226 m altitude) were used for this study (Fig. 1a). The

117
The initial 3,345 germplasm set (annotated GA panel) was subjected to Identity-By-State analysis (see 118 details below) and a unique core set of 1,536 germplasm was constructed (annotated GU panel  window of 500 markers. Imputation was performed using the genotype likelihood (GL) mode with 10 iteration 146 steps. Imputed Markers were subjected to filtering using an allelic correlation ( 2 > 0.8) equal or greater than 147 0.8. Dosage format was generated using the pseq library (http://atgu.mgh.harvard.edu/plinkseq/start-pseq.shtml).

190
To limit individual group size incidence on parameter estimates, all of the population genetic parameters were 191 computed on 120 individuals randomly sampled from each identified group (1, 2, 5,-10; Supplementary Table   192 2). Genetic groups (3, 4) with sample size n < 100 were not considered in subsequent downstream analysis.

264
To further understand the observed population structure, all individuals were mapped to their available 265 metadata, providing insights into the observed clustering pattern (see Supplementary Table 6 (Table 1). While estimates of effective population size ranges from 10.0 (NE-Admix-Group) to 303 87.2 (Bitter-Group), as seen in Table 1.

350
Interestingly, the interpolated ancestry resulting from the five stably-formed groups (from K = 5 through geographical boundaries, and restrictions to have influenced the observed population structure (Fig. 1a, Fig. 6a).

400
The distribution of germplasm and genetic diversity across Brazil is heterogeneous with more diverse

539
Genomic variation across the genome and within groups (excluding family structured group, Bahia-540 Group) reveals that on average, based on the LD decay (r 2 < 0.1), 1,300 to 4,700 SNP markers would be needed 541 to detect large association, while the SNP density of 27,045 was used in this study with 19,085 bp average distance 542 between two SNPs, a much higher marker density will be required to detect small effects association 543 (Supplementary Table 7).

548
Based on the observed genome-wide LD landscape, we speculated that the extended LD observed in chromosome 549 17 (Supplementary Figure 4), influencing the LD decay for most population groups except group 10 550 (Supplementary Figure 6b)