Background

Tea is one of the most popular beverages worldwide [1, 2] with high nutritional and medicinal values. The rich flavor of tea is contributed by nearly 700 bioactive compounds such as catechins (a subgroup of flavan-3-ols), theanine, caffeine, and volatiles [3, 4]. Tea, Camellia sinensis (L.) O. Kuntze, Theaceae (C. sinensis), has been grown in the Yunnan-Guizhou Plateau in southwest China for approximately 5,000 years and is now widely cultivated all over the world [4]. The Guizhou Plateau is the center of origin of tea [4, 5], where population diversity of the tea is well preserved with abundant wild tea plants, ancient landraces and modern landraces with different morphological characteristics—owing to the unique geology, diverse climates and plentiful rainfall in the region and the cross-pollination nature of tea plants [6]. Large spatial elimination of various tea species has not occurred due to the slow economic development and land use in the Guizhou Plateau.

Ancient tea plants belong to Sect. Thea (L.) Dyer, and are defined as varieties grown for more than 100 years. Wild teas, including wild type and self-wild type, are valuable for scientific research and application as they have mainly undergone natural selection and were only minimally affected by artificial selection. Analyzing genetic diversity and population genetic structure is significant to depicte the domestication event and genetic relationships of tea plants. It is also helpful for expediting the development on breeding strategies [7]. Molecular markers have been a powerful tool for the genetic study of tea populations, these include the RAPD [8], nSSR [1, 9], gSSRs [2], SSR [10, 11], SNP [12], AFLP [13], ISSR [14], EST-SSR markers [15, 16], etc. As revealed by these studies, current tea populations evolved from a single species in the Yunnan-Guizhou (Yun-Gui) Plateau. However, the tea populations used in these previous studies had either small sample size or narrow geographic distribution-including only 14 tea-producing regions in Yunan [17], Guangxi [18] or across China.

LD is defined as the association of allelesat different loci within a given population. Understanding the LD pattern is crucial for tea breeding [19,20,21]. GBS has emerged as a useful tool for linkage map construction and the extensive identification of polymorphisms [21, 23,24,25,26,27,28]. It has also been widely used in population structure and genetic diversity studies [29,30,31,32,33]. To our knowledge, the LD pattern, population structure, and genetic diversity of tea germplasm had never been examined within previous study using GBS. In addition, very few studies have focused on the tea population in the Guizhou Plateau [22]. Therefore, we employed the GBS approach and performed a genetic analysis on a large tea population consisting of 415 accessions including the wild varieties, ancient landraces and modern landraces in the Guizhou Plateau, as well as cultivated varieties from Zhejiang, Fujian, Hunan, and Guizhou. We aim to (1) identify SNPs at the genome level; (2) analyze the population structure and genetic diversity; and (3) characterize the LD patterns in different varieties. Our findings will facilitate future genome-wide association mapping and marker-assisted selecting of tea.

Results

Genome-wide SNPs discovery and the GBS analysis

GBS was performed on 415 tea accessions using Illumina HiSeq X ten. After the primary quality filtering step, 390.3 Gb clean data were obtained with an average of 0.94 Gb clean data per accession (Additional file 1: Table S1). Anaverage of 65% of the total reads were successfully mapped onto the tea genome (Additional file 1: Table S1). The SNPs were detected and genotyped by GATK (version 3.7.0) based on the reference genome [34]. We identified a total of 1,001,372 SNPs with a minimal set of initial quality filters. By restricting the filter conditions, the number of SNPs was subsequently reduced to 287,408, with an average SNP density of one per 10.5 kb and an average quality value of 41,262 (data not shown). The average individual heterozygosity was 17.84% (Additional file 1: Table S2). Furthermore, 79,016 high-quality SNPs were identified and an average individual heterozygosity of 19.21% was observed (Additional file 1: Table S3). All 79,016 SNPs were physically mapped across all scaffolds, with an average map density of 38.24 kb and average quality value of 41,394 (Additional file 1: Table S3). We found more transitions (62,962 loci, 79.68%) than transversions (15,650 loci, 19.81%), and the ratio of transition/transversion was 4.02. C/T transitions and C/G transversions occurred at the highest and lowest frequencies, respectively. The frequencies of A/G and C/T transitions were similar-39.83 and 39.85%, respectively, and the four different types of transversions also occurred at a similar frequency-5.89% for A/T, 5.01% for A/C, 3.81% for G/C and 5.09% for G/T (Table 1).

Table 1 Percentage of transition and transversion SNPs identified using genotyping-by-sequencing

Estimation of genetic diversity

The average genetic diversity (GD), observed heterozygosity (Ho) and polymorphism information content (PIC) of 415 tea accessions were 0.257, 0.247 and 0.214, respectively (Table 2). The percentage of polymorphic loci (PPL) was significantly higher in the cultivation type than in the wild type (Table 2; Additional file 1: Table S5). PPL was significantly higher in the Pure Cultivation Type (GP03) than in the Admixed Wild Type (GP02) and Pure Wild Type (GP01) (Table 3). Among the six zone, PPL was significantly higher in Ia than in Ic, II and III (Additional file 5). GD, Ho, and PIC were significantly higher in the cultivation type than in the wild type (Table 2; Additional file 1: Table S5). GD, Ho, and PIC were significantly higher in the Pure Cultivar Type (GP03) than in the Admixed Wild Type (GP02) and Pure Wild Type (GP01). GD, Ho, and PIC showed significantly higher diversity in Ia, Ib, Ic and II than in III and IV (Table 2; Additional file 1: Table S5; Additional file 5).

Table 2 Genetic diversity parameters of 415 tea accessions in Guizhou Plateau
Table 3 Genetic differentiation of inferred populations of tea plants in Guizhou Plateau

Population structure analysis

We used STRUCTURE and PCA to analyze the genetic structure of the tea accessions. Both analyses were performed using 1,135 LD-pruned SNPs. Based on the genetic distance matrix of the 415 tea accessions, we used TASSEL v.5.2.37 to build an UPGMA tree.

The number of clusters was estimated based on the ΔK method [35, 36] and the plateau criterion [37] in STRUCTURE, firstly. The results showed that the ΔK had the maximum value at K = 2 (Fig. 1a). Based on of this, two ancestral groups were identified (Fig. 1b). Accessions with the score higher than 0.80 were assigned to a pure group, while those with the lower than 0.80 were assigned to the admixture group. The first pure group (referred to as the ‘Pure Wild Type’ or ‘GP01’ from now on) consisted of 52 accessions, all were wild type from Camellia tachangensis F.C.Zhang, of which most were from the zones IV, III and II (Additional file 2). One hundred accessions (approximately 24% of 415 populations) exhibited an admixed ancestry. In the admixed cluster (referred to as ‘Admixed Wild Type or GP02’ from now on), 95% were wild type, including 45 Camellia Tachangensis from Ia, 50 Camellia remotiserrata Zhang from Ia, and five uncertain species (Additional file 2). The second pure group (referred to as the ‘Pure Cultivation Type or GP03’ from now on) consisted of 263 accessions, of which 98% are Cultivated type from Camellia sinensis (including the ancient landraces and modern landraces).

Fig. 1
figure 1

The genetic clusters inferred using STRUCTURE. a Graphical method allowing the detection of the number of groups K using ∆K and LnP(K). ∆K and LnP(K) are shown in blue and red, respectively. b Inferred population structure of the collection using STRUCTURE software. Bar plot of individual ancestry proportions for the genetic clusters inferred using STRUCTURE (K = 2). Individual ancestry proportions (q values) are sorted within each cluster. Admixture model, independent frequencies, 30,000 burn-in iterations, and 100,000 Markov Chain Monte Carlo iterations were used for this analysis. Cultivation type and wild type ancestral populations are shown in red and blue, respectively

The results of PCA analysis were highly consistent with those of STRUCTURE (Fig. 2). PCA revealed two main clusters that correspond to the two ancestral groups identified using STRUCTURE. The Pure Cultivation Type cluster was more scattered than the Pure Wild Type cluster, and the Admixed Wild Type was dispersed between these two clusters along the left side of the PC2 or PC3 axis (Fig. 2). The UPGMA tree also agreed with the STRUCTURE analysis results, although some subgroups were formed in the Pure Cultivation Type clusters (K = 2) (Fig. 3b). Furthermore, the results of UPGMA tree were almost concordant with the growth habits (wild type and cultivation type) (Fig. 3a), the cultivation status (modern landraces, ancient landraces and wild tea trees) (Fig. 3c) and the classification (C.tachangensis, C.sinensis and C. remotiserrata) (Fig. 3d) of tea accessions.

Fig. 2
figure 2

Principal component analysis (PCA) of 415 tea accessions. PCA using 1135 selected SNPs with no linkage disequilibrium in the set of 415 tea accessions. GP03 identified in STRUCTURE is shown in green, GP01 in red and GP02 in blue. First and second components (a) and first and third components (b) of the PCA analyses are shown

Fig. 3
figure 3

Cluster analysis based on genetic distance using an UPGMA tree. a UPGMA cluster tree compared with both growth habits, wild type (red) and cultivation type (green). b UPGMA cluster tree compared with STRUCTUER results (k = 2), Pure Wild Type (red), Pure Cultivation Type (green) and Admixed Wild Type (yellow). c UPGMA cluster tree compared with growthway, modern cultivation (red), ancient cultivation (green) and wild (yellow). d UPGMA cluster tree compared with classification results, C.tachangensis (red), C.sinensis (green), C.remotiserrata (yellow) and uncertain species (blue). e UPGMA cluster tree include 4 inferred groups, GP01 (red), GP02 (yellow), GP03–1 (green) and GP03–2 (purple)

The plateau criterion was also used to estimate the number of clusters [37,38,39,40]. As shown in Fig. 1, the mean log-likelihood (LnP(K)) curve attained a stable value at around K = 3 ~ 4 [20]. Therefore, we further analyzed the 263 accessions of the GP03 ancestral group to explore whether subgroups could be identified using STRUCTURE reported by Campoy et al. [20]. The 52 accessions in the GP01 ancestral cluster and the 100 accessions in the GP02 cluster were excluded from further analyses (Additional file 2). Within the GP03 group of the 263 accessions, we identified two subgroups at K = 2 (Additional file 3: Figure S1 and S2) based on the Evanno’s ΔK (accessions were assigned into two groups with estimated score of 0.5). The first subgroup included 213 Pure Cultivation Type accessions, of which 78% were ancient landraces (referred to as the ‘ancient landraces’ or ‘GP03–1’ hereafter).The second subgroup was smaller, containing only 50 Pure Cultivation Type accessions, of which 92% were modern landraces (referred to as ‘modern landraces’ or ‘GP03–2’ hereafter) and 8% were breeding varieties (Additional file 2). Overall, the 415 accessions were clustered into three groups, including two main groups (GP01 and GP03) and an admixed group (GP02), and the GP03 group could be further divided into two subgroups (GP03–1 and GP03–2). The result was confirmed by both the UPGMA tree (Fig. 3e) and PCA (Fig. 4) (Additional file 3: Figure S3).

Fig. 4
figure 4

Principal component analysis (PCA) of 415 tea accessions. PCA using 1135 selected SNPs with no linkage disequilibrium in the set of 415 tea accessions. The GP01 cluster identified in STRUCTURE is shown in red, The GP02 cluster in blue, GP03–1 in purple and GP03–2 in green. First and second components (a) and first and third components (b) of the PCA analyses are shown

LD analysis

In this study, the extent of LD with a physical distance larger than 500 kb for all scaffolds was evaluated in the 415 tea accessions using 143,041 non-LD-pruned SNPs (Fig. 5a). LD declined rapidly with increasing physical distance. The studied population had an overall low LD and most r2 values were below 0.16 (Fig. 5a). On average, LD declined rapidly with an r2 value below 0.08 within approximately 2 kb (Fig. 5b).

Fig. 5
figure 5

Linkage disequilibrium decay for all scaffolds longer than 500 kb. a Scatter plot of LD decay (r2) against the genetic distance for pairs of linked SNP across all scaffolds longer than 500 kb. b Zoom-in scatter plot of LD decay (r2) against the genetic distance

LD decay in the four inferred groups was estimated (Additional file 4: Figure S1). The lowest LD decay was observed in GP01, as r2 reached 0.08 (the threshold) at approximately 35 kb. Conversely, LD declined the most rapidly in GP02—r2 = 0.08 corresponded to a physical distance of approximately 1 kb—followed by subgroup GP03–1, in which r2 = 0.08 corresponded to approximately 2 kb. The LD of subgroup GP03–2 declined below r2 = 0.08 at approximately 25 kb.

Genetic differentiation analysis

Genetic variation was calculated for the four inferred groups (Table 3). The percentage of polymorphic loci (PPL) was significantly lower in GP01 than in GP02, GP03–1 and GP03–2 (Table 3). We detected no significant differences in PPL among GP02, GP03–1, and GP03–2. The genetic variations in GP02 and GP03–1 were significantly higher than in GP01 and GP03–2, with GP01 showing the lowest genetic variation (Table 3). Fis in all four inferred populations was significantly different than zero (Table 3)-Fis in GP02, GP03–1 and GP03–2 was significantly lower than zero and Fis in GP01 was significantly higher than zero.

The pairwise Fst values ranged from 0.054 to 0.178 with a mean value of 0.101 (Table 4). The lowest level of differentiation was observed between GP03–1 and GP03–2, whereas GP01 and GP03–2 differentiated the most. An intermediate differentiation was observed between GP01 and GP03–1 (Table 4). The Fst results were confirmed by the pairwise genetic distance calculated in the R package adegenet (Table 4).

Table 4 Fst and pairwise genetic distance among four inferred populations of tea plant in Guizhou Plateau

Discussion

Estimation of genetic diversity

In this study, we report the first genetic diversity analysis of a tea population using GBS-a simple and cost-effective approach [41,42,43,44]. We generated 390.30 Gb clean reads and identified 79,016 high-quality SNPs using stringent filtering criteria. The number of SNPs identified in the present study was higher than those used for previous studies [38, 39, 45, 46], suggesting that the GBS approach is powerful for the genetic diversity analyses of tea species.

Previous studies have shown that breeding practices have a greater effect on reducing genetic diversity than domestication, leading to a lower level of genetic diversity in cultivated germplasm compared with wild varieties [7]. Interestingly, our genetic diversity analysis with the Guizhou Plateau tea varieties shows the opposite—we observed a significantly higher genetic diversity level in the cultivation type than in the wild type, which is different from those reported in the previous studies [40, 41]. A plausible explanation for these counterintuitive findings could be due to the existence of ancient landraces in the cultivation type. The ancient landraces were derived from early landraces and their natural offspring, they grow on the edge of terraced fields to prevent soil erosion or used as fences to separate the fields owned by different farmers; such human activities were not for breeding purposes. The cross-pollination characteristics of tea species had also contributed to the large genetic variation in the cultivation type. The relatively isolated natural environment of the Guizhou Plateau may have reduced the genetic perturbations in the wild type group from other tea varieties. Consistent with our hypothesis, a narrow genetic diversity of tea cultivars has been reported in tea-producing regions worldwide where several tea clone cultivars dominated the local populations [32, 33].This will not only impose limitations on tea breeding but also increase the risk of natural hazards because wild tea plants and landraces provide valuable genetic resources for tea-breeding [40]. Such a scenario is especially true for the Guizhou Plateau, which has many ancient landraces and Pure Wild Type accessions, both can be used for tea breeding. Therefore, future studies should focus more on the tea germplasm in the Guizhou Plateau.

Population structure

In this study, we used three different approaches (STRUCTURE, PCA, and UPGMA) to analyze the population structure of the 415 tea varieties, and the results we obtained complemented the previous studies. STRUCTURE could effectively identify global clusters, which were subsequently validated by PCA. However, the two parameters we used to determine the number of clusters in STRUCTURE yielded different K values—the Evanno’s ΔK method identified K = 2 when analyzing the entire germplasm collection and the cryptic structure. Evanno’s method focuses exclusively on the change in slope, therefore, it estimates the uppermost level structure of the data which may cause ΔK to be artificially maximal at K = 2 in some cases, as reported previously by Campoy JA et al. [20]. We used the maximum likelihood parameter in our analyses as recommended by Pritchard [37], in which K was set to three. K = 3 appeared to fit the origin and the pedigree of the accessions in the Guizhou Plateau. Therefore, the 263 accessions in GP03 obtained with STRUCTURE at K = 2 were further analyzed. The clustering of the tea accessions correlated well with cultivation status origin at K = 2 as revealed by the Evanno’s ΔK method—the 415 accessions were clustered into four populations, including two main populations (GP01 and GP02) and two subgroups (GP03–1 and GP03–2). All accessions in GP01, the Wild Type group, were C. tachangensis; the Admixed Wild Type group GP02 contained C. tachangensis and C. remotiserrata varieties; GP03–1 represented ancient landraces, all of which are C. sinensis; and GP03–2 consisted of cultivated varieties including modern landraces and breeding varieties, most of which are C. sinensis.

We detected the lowest genetic differentiation and genetic distance between the modern and ancient landraces. The Pure Wild Type and modern landraces exhibited the largest genetic differentiation and genetic distance, followed by that between the Pure Wild Type and ancient landraces, and that between the Admixed Wild Type and ancient landraces. These results support the notion that the evolution of tea plants was related to the historical tea cultivation in the Guizhou Plateau. The Pure Wild Type is the most primitive resource that originated in the region, and the retained species purity was owing to the isolated ecological environment. The ancient landraces and the Admixed Wild Type likely emerged in the Ming Dynasty, when local landraces, introduced landraces, and wild species were co-cultivated. The co-cultivation facilitated cross-pollination among different germplasms, which reduced the genetic distance and differentiation between the ancient landraces and the Admixed Wild Type and significantly increased the diversity of the ancient landraces and the Admixed Wild Type among all inferred groups. Most modern landraces and breeding varieties were assigned to GP03–2, reflecting a narrowed genetic basis of the modern landraces due to breeding practice.

We observed the lowest genetic differentiation between GP03–1 and GP03–2, suggesting that human activities may have caused frequent gene exchange between these two subgroups. GP01 and GP03–2 showed the highest level of genetic differentiation and distance, implying that geographic isolation has restricted the gene flow among populations. This observation could also be a result of the reproductive isolation between species. According to our results, GP03–1 and GP02 exhibited a higher genetic diversity compared with GP01 and GP03–2, therefore, varieties in GP03–1 and GP02 can be used for tea improvement. As revealed by our data, the differences between species did not affect clustering, which reflected the complexity and uncertainty of the tea classification systems. Thus, it is necessary to establish a more scientific classification system. In addition, natural hybridization between tea species may be another explanation of the results mentioned above (Additional file 1: Table S6; Additional file 1: Table S7).

Linkage disequilibrium

LD decays more rapidly among cross-pollinated species like tea plants than among self-pollinated species due to the less effective recombination in the latter [49, 50]. We observed a rapid LD decay in the 415 accessions—LD declined below r2 = 0.08 at approximately 2 kb, lower than that observed with Prunus [20] and melon [21]. This can be due to the self-incompatibility of tea plant [48]. The rapid LD decay and the high proportion of SNPs in LD suggest that GWAS can be used to inform the breeding of the tea varieties in the Guizhou Plateau. These findings are not consistent with those of Jin et al. [5], which may be caused by the differences in the genetic backgrounds among different varieties within each species. In cross-pollinated species, LD can be affected by extreme genetic drift in domestication and breeding during evolution [20]. Thus, we investigated LD decay among the subgroups to provide valuable genetic information for future studies [21]. Subgroups GP01 and GP03–2 displayed a much slower LD decay than GP02 and GP03–1, which is likely because modern landraces had experienced artificial selection pressure and the Pure Wild Type experienced extreme genetic drift, leading to the fixation of a higher number of LD blocks. The slow LD decay in the Admixed Wild Type group and ancient landraces facilities the identification of markers associated with desirable traits, as a relatively small number of markers could cover the entire genome. The Admixed Wild Type group and ancient landraces are ideal populations that can be directly used for breeding—varieties from the Pure Wild Type group can be crossed with modern landraces to achieve heterosis due to a relatively greater genetic distance between these two groups among all.

Conclusions

Genome-wide SNPs in various tea varieties from the Origin Center, Guizhou Plateau, were identified in this study using GBS. These SNPs were used to analyze the genetic diversity, population structure, and LD pattern of the 415 tea accessions. Our results showed that the 415 accessions could be clustered into four populations, including two main populations (GP01 and GP02) and two subpopulations (GP03–1 and GP03–2). The ancient landrace group was found to have a more complex genetic structure than the wild and modern landraces. These data will inform the collection, conservation, and application of the tea varieties in the Guizhou Plateau.

Materials and methods

Plant materials

A total of 415 samples including 159 wild varieties and 256 cultivated varieties (174 ancient landraces, 77 modern landraces and five breeding varieties) were included in this study (Additional file 5; Additional file 2). According to the classification systems reported by Chen et al. [52] and Min [53], 251 Camellia sinensis (L.) O. Ktze, 100 Camellia tachangensis (F.C.Zhang), 59 Camellia remotiserrata (Zhang) and five near Camellia taliensis (W.W.Smith) were identified (Additional file 2). Hereafter, samples from the wild tea trees that are more than 100 years old and their natural offsprings are referred to as “wild type”; samples from cultivated tea varieties of more than 100 years old are referred to as “ancient landraces”, and samples from garden tea landraces are referred to as “modern landraces” (Additional file 2). The “ancient landraces”, “modern landraces” and “breeding varieties” that had undergone artificial selection were all referred to as “cultivation type”.

We collected the samples from different tea growing areas with different climates (Additional file 5). Specifically, a total of 276 samples were collected from tea varieties growing in the areas with very suitable climates in Guizhou, these include 168, 51 and 57 accessions in northern (Ia), eastern (Ib) and southern Guizhou (Ic), respectively. Eighty-three samples were harvested from central Guizhou where the climate is suitable for tea growth (II). Forty-one samples were collected from the areas in western Guizhou with a minor suitable climate (III), and 10 samples were from areas in western Guizhou with an unsuitable climate. One variety was collected from Guizhou. Four varieties were collected from other provinces, these include two from Fujian, one from Zhejiang, and one from Hunan (Additional file 5; Additional file 2) [35]. The samples were planted in the city of Guiyang, China. Fresh leaves harvested from each accession were snap frozen in liquid nitrogen and stored at − 80 °C until use.

DNA extraction

We used the Plant Genomic DNA Rapid Extraction kit (Biomed Gene Technology) to isolate genomic DNA from the samples. DNA integrity was tested on 1% agarose gel, and DNA purity was tested and quantified using Qubit Fluorometer (Invitrogen).

Library preparation and sequencing

We used 5 U of SacI and MseI (NEB) and 1 × restriction buffer in a 25 μl reaction to digest 100 ng genomic DNA. After digestion, SacAD and MseAD adaptors were ligated to the digested DNA fragments; 12 samples were pooled in equal volumes and purified using the QIAquick PCR Purification Kit (Qiagen) [47]. We then used the PCR Primer Cocktail and PCR Master Mix to amplify the purified DNA fragments. Amplicons of 500–550 bp (including the 120 bp adaptor) were retrieved through electrophoresis using 2% agarose gel and purified using the QIAquick Gel Extraction Kit (Qiagen) [47]. The Agilent DNA 12,000 kit and 2100 Bioanalyzer system (Agilent) were used to determine the average length of DNA fragments, and the resulting DNA libraries were quantified using real-time PCR with a TaqMan probe and sequenced on the Illumina HiSeq X ten platform with the paired-end 150 (PE150) sequencing strategy. Each library contains 48 samples, and we matched the clean reads individually to the barcodes and remnant restriction sites at both ends [47].

Sequence alignment and SNP identification

The barcodes were used to de-multiplex the raw DNA reads, and a custom perl script was used to trim the adaptors. Only the reads with quality values > 5 were retained as the clean data, and then aligned to the reference genome (http://www.plantkingdomgdb.com/tea_tree/) [3] using BWA-MEM (version 0.7.10) with parameters ‘-T 20 -k 30’ [54]. GATK (VERSION 3.7.0) was used call for SNPs.

The SNPs were filtered according to the methods used by Hussain et al. [23], Chen et al. [19] and Eltaher et al. [28] based on the following criteria: (1) variants must be bi-allelic SNPs; (2) “QUAL < 50.0 || QD < 2.0 || FS > 60.0 || MQ < 40.0 || Mapping Quality Rank Sum < -12.5 || Read Pos Rank Sum < -8.0” was used in variant filtration in GATK (version 3.7.0) to filter the SNPs; (3) SNPs with minor allele frequency (MAF) lower than 0.05 or missing data rate higher than 20% were filtered out by VCFtools (version 0.1.15); (4) The SNPs were pruned with a window of 50 SNPs, a step size of 10 SNPs, and an r2 threshold of 0.2 by Plink (v1.9). After the filtering, 415 accessions and 79,016 SNPs were retained and used for further analysis.

Analysis of genetic diversity

The polymorphism information content (PIC) values for the SNP data were calculated using the following equation [19].

$$ \mathrm{PIC}=1-\sum \limits_{i=1}^n{P}_i^2-\sum \limits_{i=1}^{n-1}\sum \limits_{j=i+1}^n2{P}_i^2{P}_j^2 $$

The mean number of observed alleles per locus and the observed heterozygosity (Ho) were calculated for each group using TASSEL v.5.2.37 [55]. Genetic diversity and inbreeding were calculated for each group using PowerMarker v3.25. Fst was calculated for each group using VCFtools [56].

Linkage disequilibrium

Prior to the PCA and STRUCTURE analyses, we LD-pruned the SNPs again using Plink (v1.9) [51] with a window of 50 SNPs and a step size of five makers. The r2 threshold was 0.4. PLINK was used to measure pairwise LD between multi-SNPs [20, 54]. The pairwise LD between 143,041 genome-wide unpruned SNPs from sequences longer than 500 kb was calculated based on the allele frequency correlations (r2) using PopLDdecay program1. To summarize the relationship between LD decay, we fitted a locally-weighted linear regression (loess) model to the r2 data [20, 57] using R function ‘loess’ (http://www.R-project.org/) [58] with r2 summarizing both the recombinational and mutational history [59]. The LD decay plot was drawn using R.

Population structure

Population structure was analyzed using the model-based Bayesian analysis implemented in STRUCTURE [37]. The number of subpopulations (K) was determined using the mean likelihood values in the ΔK method and the lnP (K) values [36, 59] calculated by Structure Harvester [60]. We estimated the variance between replicates by continuously running K = 1–9 to determine the optimal population number [19]. The analysis was conducted with a burn-in of 30,000 iterations followed by 100,000 Markov Chain Monte Carlo (MCMC) replications in three independent runs. No previous information was used to define the clusters. We enforced K to its true value to assess the clustering results. For each given K value, the run with the highest likelihood was used to cluster the accessions. We set the threshold value at 0.8 to distinguish between the pure and mixed groups. PCA was performed using TASSEL v.5.2.37 [55]. We set the threshold value at 0.8 to distinguish between the pure and mixed groups. The genetic distance among different individuals was used for PCA and constructing the UPGMA tree. The UPGMA tree was generated using a simple matching coefficient in TASSEL v.5.2.37 [37]. Fst and pairwise genetic distance among the four inferred groups were calculated in the R package adegenet v.2.1.1 [61].