Background

In crops, the level of genetic diversity and linkage disequilibrium (LD) can be affected by various factors including demography and inbreeding [16], selection for favorable alleles [7, 8], domestication [2, 9, 10], outcrossing of crop cultivars with genetically distinct lines of wild ancestors and landraces [1, 11, 12] and admixture [13, 14]. Genetic diversity of domesticated crops is usually reduced compared to wild ancestors [2, 6, 9, 15, 16]. In tetraploid wheat, the population bottleneck that accompanied tetraploid emmer wheat domestication about 10,000 years ago [17] reduced nucleotide diversity by 30 to 50% in the A- and B-genomes, depending on the study and diversity measure used [15, 18]. Diversity was further reduced in hexaploid wheat as a consequence of the polyploidy bottleneck resulting from hexaploid wheat speciation [18, 19]. Different rates of gene flow from the ancestors of hexaploid wheat, tetraploid wheat for the A- and B-genomes and Aegilops tauschii for the D-genome [20, 21] resulted in different levels of diversity in hexaploid wheat genomes [19]. While diversity levels are similar in the A- and B-genomes, it is greatly reduced in the D-genome [18, 19]. The D-genome also shows higher levels of LD than the A- and B-genomes [19, 22].

Interpreting patterns of genetic diversity in modern crop cultivars is further complicated by strong selection and interbreeding with landraces and genetically distant wild relatives. Regions of the genome subjected to recent selection or introgressions from landraces or wild relatives were shown to have elevated LD and low genetic diversity [7, 12, 23]. Factors such as inbreeding, human- and environment-driven selection, founder effect and gene flow influence the distribution of genetic variation both across the genome and between populations resulting in the formation of genetically differentiated groups [26]. Strong population structure has been reported for many crops including wheat [5, 22, 24, 25]. A thorough understanding of population structure has important implications as population structure is one of the major reasons for false associations between phenotypes and markers in association mapping (AM) studies [26]. Hence, the inclusion of population structure estimates in AM is important for reducing spurious associations [27].

On the practical level, the distribution of genetic diversity in modern cultivars plays an important role in the choice of specific mapping and crop improvement strategies. In recent years, association mapping was shown to be a powerful method for complementing the traditional gene mapping studies based on controlled crosses [5, 2830]. The extent of LD defines the marker density required for genome-wide association mapping (GWAM). GWAM experiments in human and natural plant populations require several hundred thousand SNPs for finding a marker allele linked to a causal mutation [4, 31]. However, the elevated level of LD in crop populations suggests that a smaller number of markers can provide sufficient genome coverage for finding marker-trait associations. Indeed, in two-row spring barley cultivars significant intra-chromosomal LD extended up to 15 cM [3]. Analysis of LD patterns in U.S. wheat populations showed significant LD extended to 5 cM [25] or 10 cM [22] while some populations of durum wheat (T. turgidum) retained more than 50% of their initial LD value at distances up to 20 cM [24]. In theory, by selecting a set of closely related cultivars it should be possible to increase the extent of LD and use fewer markers for detecting associations.

Recent advances in DNA sequencing and genotyping have enabled genome-wide studies capable of characterizing genetic variation and the extent of LD in natural and breeding populations. Single nucleotide polymorphism (SNP) has become the most frequently used type of molecular marker for these analyses in many species because of their high abundance across the genome and the availability of cost-effective high-throughput genotyping assays [3234]. One of the first sets of SNPs developed for polyploid wheat [19] was used in this study to design a 1536-plex wheat oligo pool assay (wheat OPA) to analyze the patterns of SNP variation and LD in diverse populations of cultivated spring and winter wheat lines from the US and CIMMYT ("Centro Internacional de Mejoramiento de Maíz y Trigo") breeding programs. This knowledge is critical to the design of valid GWAM experiments in wheat and useful for understanding the role of selection and breeding in the distribution of genetic variation across the wheat genome.

Methods

Plant material

The wheat lines included in our study represent diverse cultivars utilized in 17 wheat breeding programs including 9 winter and 8 spring wheat populations. All cultivars were selected to represent the current genetic and phenotypic diversity of a specific state's breeding program. The phenotypic traits targeted by breeding programs include: disease resistance (leaf, stripe and stem rusts), winter survival for winter wheat breeding programs in the Midwest, end-use quality, terminal heat tolerance, resistance to drought stress, yield potential, early maturity, resistance to the wheat stem sawfly and herbicide tolerance. The complete list of lines with their pedigrees is provided in the Additional File 1.xls. Plants were grown in a greenhouse and DNA was extracted from the leaves of 4-6 week old seedlings using methods described before [35].

SNP genotyping

The SNPs discovered in a panel of 32 lines of tetraploid and hexaploid wheat were downloaded from the Wheat SNP Database [36]. SNP selection and assay design were performed according to previously described procedures [32]. The following criteria were applied for SNP selection: no more than 2 SNPs were selected per locus, with preference being given to SNPs present in at least two lines in the discovery panel. Additional SNPs were discovered by sequencing the transcriptomes of T. aestivum cv. Chinese Spring and Jagger. Repetitive elements were detected and masked by comparing sequences with the TREP [37] and GIRI [38] databases. The masked sequences were submitted to Illumina for processing by Illumina® Assay Design Tool (ADT). The ADT generates designability rank scores for each SNP that can vary from 0 to 1. The SNPs with scores above 0.6 have a high probability of being converted into a successful genotyping assay. A total of 1536 SNPs were selected for developing the wheat OPA (Additional File 2.xls). Genotyping was performed at the USDA-ARS genotyping laboratory in Fargo, North Dakota according to standard Illumina GoldenGate assay protocols [39]. Subsequent genotype calling was carried out using Illumina's BeadStudio software v.3. The accuracy of the genotype call was manually evaluated for the misclassification of homozygous and heterozygous clusters using the software's clustering algorithm. This step proved critical for reducing the genotyping error rate associated with peculiarities of clustering patterns in polyploid wheat. Following the removal of loci with low-quality clustering, the previously estimated genotyping error rate for hexaploid wheat was a mere 1% [32].

Genetic diversity

Genetic diversity was evaluated by calculating the polymorphism information content ( P I C = 1 i n p i 2 , where pi is the frequency of the i-th allele [40]) for the number of alleles across and within breeding programs using PowerMarker software [41]. Analyses were performed separately on four datasets: three datasets included SNPs grouped by genome and one dataset included complete set of SNPs.

Population structure

For analysis of population structure, the SNP dataset was divided into the three genome-specific datasets and one combined dataset. To reduce the effect of frequency correlation between linked alleles, we selected SNP loci located approximately 4 cM or farther apart from each other. The A-genome dataset included 91 SNP loci while the B-genome and D-genome dataset included 89 and 39 SNP loci, respectively (Additional File 3.xls). We assumed that each individual in the population was homozygous for all loci, and heterozygous loci were treated as missing data. The proportion of heterozygous loci in our dataset was 0.5%. The population structure was inferred using the model-based Bayesian clustering approach implemented in the program Structure [42]. A total of 10 iterations of Gibbs sampler were run for an admixture model with both correlated and non-correlated allele frequencies [43]. Lengths of burn-in and simulation runs of 105 and 106, respectively, were selected based on the convergence of summary statistics (log probability of data) among several independent runs. Results of independent runs for the same value of K were summarized using the CLUMPP program [44].

The number of populations (K) present in the dataset was estimated by plotting the probability of data ln Pr(X|K) for each value of K. The variation of ln Pr (X|K) among independent simulation runs with the same value of K and the rate of ln Pr (X|K) change from K-1 to K was used to select the optimal number of populations in the sample (Additional File 4.ppt). Two ad hoc methods for estimating the correct number of K in the sample, suggested by Pritchard et al. [42] and Evanno et al. [45], were also tested. Population assignments for each individual were visualized using Distruct software [46]. Similar analyses were performed using the program InStruct [47]. This program extends the Bayesian clustering algorithm implemented in the program Structure [42] by removing the assumption of Hardy-Weinberg disequilibrium within inferred clusters and relying instead on selfing rates to calculate expected genotype frequencies [47]. The simulations were run for 100,000 steps after 50,000 burn-in iterations.

Population structure was also analyzed using the principal component analysis (PCA) implemented in the software EIGENSTRAT [48]. Data from 597 polymorphic SNPs with minor allele frequency >0.05 was used to assess the clustering of genetic variation among all 478 samples investigated.

The differentiation of populations was further investigated by estimating FST for individual loci and the components of variance for two levels of population hierarchy using methods described by Weir and Cockerham [49] and Weir [50] as implemented in the software package PowerMarker [39]. FST provides a measure of population differentiation by estimating the correlation of alleles within the same sub-population relative to that found in the entire population. The overall distribution of genetic variation in the wheat cultivars was estimated for two levels of population hierarchy: growth habit (spring and winter wheat) and origin (breeding populations). The winter wheat population from Kansas was excluded from the analysis due to the insufficient number of lines. The mean FST values in a sliding window of 5 consecutive linked SNPs were calculated to identify genomic regions genetically differentiated between spring and winter wheat lines. A 95% confidence interval (CI) for mean FST values was estimated by sampling 1,000 times with replacement the sets of 5 SNPs randomly selected from the 849-SNP dataset and taking the 95th percentile of the distribution of means.

Regions of the wheat genome showing elevated FST levels were compared with the positions of previously mapped or cloned flowering time QTL. The sequences of genes containing SNPs included in the wheat OPA were compared with the sequences of gene-derived flanking molecular markers (ESTs, cDNA) used in QTL mapping studies. The syntenic relationship between wheat, rice and Brachypodium genomes was used to compare and validate map positions.

Linkage disequilibrium

For measuring LD, the locations of gene loci harboring SNPs on the Ae. tauschii genetic map reported by Luo et al. [51] were used. Pair-wise linkage disequilibrium (LD) was measured using the squared allele-frequency correlations, r2 , according to Weir [50]. In order to reduce the variation of LD estimates generated by the inclusion of rare alleles, only SNP alleles with minor allele frequency (MAF) higher than 0.05 were used in these calculations. LD levels and the rate of LD decay were assessed by calculating r2 for pairs of SNP loci and plotting them against genetic distance. The relationship between LD decay and genetic distance was summarized by fitting a locally-weighted linear regression (loess) line to r2 data. The statistical significance of individual r2 estimates was calculated by the exact test following Weir [50]. The false discovery rate (FDR) was established at 0.01 using the Benjamini & Hochberg method [52]. Chromosome specific r2 values were plotted using the R package LDheatmap [53]. Blocks of SNPs showing elevated levels of LD were identified using the method described by Gabriel et al. [54] and implemented in the program Haploview [55]. Background LD was estimated as the 95th-percentile of the distribution of r2 values for unlinked SNP loci [25].

Results

SNP genotyping and variation

The genotyping of 478 spring and winter wheat lines with multiplexed 1,536 Illumina Golden Gate SNP assay generated 734,208 genotyping data points (Table 1). After the removal of SNPs failing to generate clear genotype clustering, 1,299 SNPs with high quality genotype calls were obtained with a 85% success of SNP conversion into the working genotyping assays. Considering these SNPS, 849 were polymorphic among the 478 lines included in this study. Most genotypes were homozygous (400,328 = 98.6%) with only a small fraction showing residual heterozygocity (1,961 = 0.5%) or no amplification (3,533 = 0.9%). Eighty-three percent of the SNPs were polymorphic in both spring and winter wheat populations. Among the 849 polymorphic SNPs, only 52 and 97 SNPs were monomorphic in the panels of 241 spring and 237 winter wheat lines, respectively (Table 2). A high proportion of polymorphic SNPs (70% - 85%) was recovered within populations of different origin, suggesting a high level of diversity within all U.S. breeding populations (Table 2). After exclusion of two winter wheat populations (KS and NY) due to their relatively small sample sizes (4 and 10, respectively), the estimates of polymorphism level (PIC) varied within a very narrow range, between 0.14 to 0.16 among the winter populations and from 0.14 to 0.18 among the spring wheat populations.

Table 1 Wheat OPA evaluation
Table 2 Average estimates of minor allele frequency, number of alleles per locus, and polymorphism information content.

Significant differences were detected between the distribution of MAF classes in spring and winter wheat (χ2 = 50, P = 3.5 × 10-10, Figure 1). The spring varieties showed increased proportion of medium frequency alleles with MAF > 0.3.

Figure 1
figure 1

Distribution of minor allele frequency classes in the populations of spring and winter wheat.

Population structure of winter and spring wheat

For the analysis of population structure, the SNP genotyping data was organized into four datasets: three genome-specific datasets for the A- (91 loosely linked SNPs), B- (89 loosely linked SNPs), and D- (39 loosely linked SNPs) genomes and a combined set of 219 SNPs covering the entire wheat genome (see Methods and the Additional File 3.xls). The comparison of admixture models assuming independent or correlated allele frequencies suggest that a model assuming independent allele frequencies is more appropriate than the model with correlated allele frequencies for inferring the number of genetically homogeneous clusters in our dataset. The appropriate choice of the model is strongly influenced by the evolutionary history of populations. The breeding populations include diverse lines subjected to different selection regimes that can result in differentiated allelic frequencies in populations adapted to varying environmental conditions. Falush et al. [43] also pointed out that the correlated allele frequency model may overestimate K when the allele frequencies between populations are different. Since the models assuming independent and correlated allele frequencies produced similar population subdivision and classification of cultivars, hereafter, we describe only results from the independent model unless otherwise noted.

First, we tested whether structure analysis assuming K = 2 would assign winter and spring wheat cultivars into two separate clusters. Both frequency models based on the A- and B-genome SNP data produced similar population subdivision. The inferred population structure was consistent across multiple simulation runs. The majority of winter and spring wheat breeding populations were assigned to separate clusters (Additional File 5.xls). Only the A-genome data in the NY winter wheat population showed an equal proportion of ancestry in the two clusters (Additional File 5.xls). Grouping of varieties at K = 2 using the D-genome data did not result in clear separation of spring and winter lines. Only 7 out of 17 breeding populations derived more than 80% of their D-genome's genetic ancestry from only one of the two clusters.

To identify the optimal number of K clusters in genome-specific datasets, we calculated the posterior probability Pr (K|X) [42] and ΔK [45] for each simulation run. The posterior probability in structure runs was constantly increasing with increasing the values of K ranging from 2 to 21 providing little guidance in selecting the optimal number of clusters. The InStruct software [47] showed a similar trend (data not shown) for the same range of K values. These observations were consistent with previously reported analyses of population structure in barley and maize breeding populations using multi-locus SNP data [56, 57]. Therefore, the selection of the optimal value of K in this study was based on the analysis of relationship between Pr (X|K), value of K and the variation of Pr (X|K) among multiple independent runs of Gibb's sampler.

The probability of data for K from 2 to 5 for the A-genome SNP dataset was consistent among multiple independent runs of Structure (Figure 2 and Additional File 4.ppt). For the B-genome SNP dataset, we obtained consistent Pr (X|K) for K values varying from 2 to 4. For values of K above 6 for the A-genome dataset and above 5 for the B-genome dataset the simulation runs could not converge to a single mode (Additional File 4.ppt). The ambiguity of clustering solutions was also accompanied by smaller increase in the mean Pr (X|K). The cluster analysis of both the A- and B-genome datasets showed that the rate of change of Pr (X|K) with increase in K reached more or less stationary value (~200) for values of K = 4 or higher (Additional File 4.ppt). The maximum likelihoods of clustering obtained for correlated and uncorrelated allele frequency models suggested different values of K for the D-genome dataset. The likelihood of the correlated allele frequency model for the D-genome dataset reached its maximum at K = 9. However, the rate of likelihood gain decreased for K values above 7. The likelihood of independent allele frequency model showed that the improvement of the likelihood of clustering dropped dramatically for K above 5.

Figure 2
figure 2

Population structure of spring and winter wheat lines. Clustering was performed using A-, B-, D-genome and combined SNP sets. The genotype of each line on the figure is represented by a colored line where each color reflects the membership of a cultivar in one of the K clusters. The proportion of the colored segment indicates the proportion of the genome drawn from the K clusters.

The genome-wide set of 219 SNPs was first used for assigning each of the 17 pre-defined populations to separate clusters. It is expected that each population should have maximum membership in only one cluster if the allele frequencies among populations are significantly different. However, the clustering analysis demonstrated that in several cases more than one population had membership in the same cluster (Figure 3A). There were also at least five clusters for which none of the 17 pre-defined populations showed a maximum membership coefficient. The maximum values of population-specific membership coefficients Q suggested that only NY and OK winter wheat populations and SD, CA, CM, MN, and MT spring wheat populations derived the majority of their alleles from a single cluster. These results indicated that 17 clusters exceeded the actual number of genetically distinct populations in our sample. A number of cultivars from the CA spring wheat population share ancestry with the lines from the CIMMYT population and nearly all SD and MN spring cultivars were assigned to the same cluster. The winter wheat populations showed lower levels of genetic differentiation with the majority of cultivars in SD, NE, CO, and MT populations having membership coefficient above 0.5 in the same cluster. A reduction in the number of K clusters to ensure that each predefined population has a maximum membership in only one cluster led to the conclusion that optimal clustering can be achieved at K = 9 (Figure 2 and Additional File 6.xls). This value of K is also supported by the calculation of the likelihood of data (Figure 3B).

Figure 3
figure 3

Clustering of 17 pre-defined wheat populations. A) Proportion of membership (Q) of 17 pre-defined populations in 17 clusters inferred using the Structure program. Maximum membership coefficients in one of the 17 clusters for each pre-defined population are indicated by thick lines; B) The log probability of data as a function of K for genome-wide 219 SNP dataset. Means (black bars) and 95% confidence intervals (grey bars) of log probability of data Ln Pr (X|K) for each value of K were calculated from 10 independent runs of structure with 100,000 burn-in steps and 106 simulation steps.

The shared ancestry can result from the usage of related cultivars in different breeding programs, which sometimes can be reflected in the pedigree of lines included in our study (Additional File 1.xls). For example, the cultivar Express was used in the spring wheat breeding programs of CA and WestBred, and cultivar Milan was used in the breeding programs of CIMMYT and CA. However, pedigree data clearly shows that only small fraction of cultivars share common parents. Apparently, pedigree has limited power for inferring the level of genetic relatedness among cultivars in breeding programs because: 1) common lines can be used only at the very early stages of cultivar development and derived lines may carry only small fraction of parental genotype; 2) other parents used in a breeding program can have larger contribution to the genotype of cultivar; 3) pedigree has an unknown level of error; 4) some pedigrees may contain incomplete information.

Principal component analysis (PCA) of population structure revealed that the first two principal components can separate the wheat populations into 3 clusters (Figure 4). The eigenvector 1 separated spring and winter wheat populations; the eigenvector 2 separated spring wheat populations into two clusters, with one cluster predominantly containing lines from MN and SD, while spring lines from CA, CM, ID, WA and WB populations were grouped more closely together. The spring wheat lines from MT, however, showed higher levels of admixture and thus could not be placed within a single cluster (Figure 4).

Figure 4
figure 4

Principal component analysis of genotyping data. A) Clustering of winter and spring wheat lines; B) Clustering of pre-defined breeding populations; sCA, sCM, sID, sWA, sWB, sMN, sMT and sSD - spring wheat populations; wMT, wNE, wSD, wKS, wNY, sCO, wOK, wTX and wWB - winter wheat populations.

Genetic differentiation of populations

Analysis of molecular variance components based on the 849 SNPs showed that wheat lines within breeding populations are more highly genetically differentiated than wheat lines among different populations, which is reflected by a higher proportion of variance within breeding populations than among breeding populations and growth habit groups (Table 3). A higher proportion of among-subpopulation genetic variation was found in spring wheat (17.2%) compared to that in winter wheat (10.6%), indicating that in our panel spring wheat populations are more genetically differentiated than winter wheat populations. Only 9.7% of genetic differentiation among lines was explained by grouping all varieties into spring and winter populations.

Table 3 Analysis of molecular variance components.

The FST estimates for individual loci between spring and winter wheat populations inferred a high level of heterogeneity across chromosomes (Table 4) with the majority of SNP loci having low FST values (Additional File 7.tif). The substantial variation in FST estimates for single SNP loci is supported by the standard deviations of chromosome- and genome-specific FST estimates exceeding the values of means (Table 4). Similar observation for single-locus FST estimates was previously noted in human populations [58]. We found that the FST estimates for individual genomes were consistent with the results of population structure analysis. A lower level of genetic differentiation between the spring and winter wheat populations was observed in the D-genome (mean FST = 0.07) relative to the A- (mean FST = 0.1) and B-genomes (mean FST = 0.11). Up to 50% reduction in variation of single-locus FST estimates can be achieved by calculating the group means of adjacent SNP markers (Table 4). As shown in the Additional File 7.tif, the mean FST values for windows of 5 SNP loci exhibited a lower proportion of extreme values and a narrower distribution. However, the fact that significant variation was still observed for window-based FST estimates suggests that similar FST values are clustered according to their genomic location.

Table 4 Means and standard deviations of single-locus FST estimates obtained for spring and winter wheat populations.

Means of FST estimates in windows of 5 loci were used to identify the regions of the wheat genome which are genetically differentiated between spring and winter wheat populations. Although this approach cannot be considered a formal test for selection, it can be used as a preliminary test to identify genomic regions harboring genes controlling plant growth habit. Random permutation of genome-wide single-locus FST values was used to estimate 95% confidence interval (CI) for window-based FST values. The confidence interval was used as a threshold to identify regions showing FST values higher than the genome-wide mean. Seven regions with FST values higher than 95% CI were detected on chromosomes 2A, 2B, 5A, 6B and 7B (Table 5). Comparison of these regions with previous genetic studies of flowering in wheat showed that three out of the seven regions overlap with previously mapped flowering time QTL or cloned flowering genes [5961].

Table 5 Genetic map intervals that show elevated levels of FST and overlap with flowering time QTL.

Patterns of linkage disequilibrium across the wheat genome

A total of 394 genetically mapped SNPs were used for estimating the extent of LD in wheat populations. Only SNP loci having MAF ≥ 0.05 in a particular population were used for analysis. Loci mapped to the regions of the wheat genome known to be subjected to structural alterations during evolution were excluded from this analysis [62]. In the spring and winter wheat populations, the average extent of significant intra-chromosomal LD (FDR ≤ 0.01) was 20.8 cM (median 11.5 cM) and 19.2 cM (median 10.7 cM), respectively. Significant LD was observed for 68% (253/370) of SNP loci in spring wheat and for 71% (247/348) of SNP loci in the winter wheat. Significant LD was detected for 8.9% and 5.7% of unlinked SNP loci (located on different chromosomes) in the spring and winter wheat populations, respectively. In the combined population of 478 lines, the mean intra-chromosomal LD extended over 19.5 cM (median 10.7 cM) and significant associations were discovered in 8.1% of unlinked loci.

The highest extent of significant LD in the spring, winter and combined populations was observed in the wheat D-genome (mean 20.8 - 27.1 cM), followed by lower LD in the A- (mean 16.7 - 19.8 cM) and B-genomes (mean 20.4 - 21.5 cM) (Additional File 8.xls). The estimates of median LD for significant associations were similar among the wheat genomes and populations, suggesting that the differences among estimates of mean LD were driven by population- and genome-specific differences in the long-range LD, which in most cases exceed 10 cM.

The estimates of r2 for all pairs of linked 394 SNP loci were used to assess the rate of LD decay with genetic distance. The statistically significant threshold for r2 in the spring and winter wheat populations was 0.08 and 0.02, respectively. In the A-genome, LD declined to 50% of its initial value at about 5 cM in winter wheat compared to 6.3 cM in spring wheat (Figure 5A). In the D-genome, LD decayed faster in winter wheat declining to 50% of its initial value at about 6 cM whereas in spring wheat a similar level of LD was reached at 7 cM (Figure 5A). Both spring and winter wheat populations showed an identical rate of LD change in the B-genome decaying to 50% of its initial value over 7 cM (Figure 5A).

Figure 5
figure 5

LD decay estimated using the wheat A-, B-, and D-genome and combined SNP sets. A) Decay of r2 as a function of genetic distance between SNP markers estimated for A-, B-, and D-genomes of the spring and winter wheat populations. B) Decay of r2 as a function of genetic distance between SNP markers estimated for the spring wheat populations from different geographic location in US and Mexico. B) Decay of r2 as a function of genetic distance between SNP markers estimated for the winter wheat populations from different geographic location in US. Only populations with more than 20 wheat lines were included in these analyses.

In order to investigate population-specific recombination processes we studied the rate of LD decay within the populations of different origin (Figure 5B and 5C). Due to the limited number of lines, WB, KS and NY winter wheat and WA spring wheat populations were excluded from this analysis. As expected, compared to LD estimates in the combined population dataset (Figure 5A), the estimates of LD within populations were higher (Figure 5B and 5C). The level of initial LD (the highest value of LD on Figure 5) in spring wheat populations, except for the CIMMYT population, varied from 0.43 to 0.49 (Figure 5B). The CIMMYT population had the lowest level of initial LD (r2 = 0.35) decaying to 50% of its value at about 10.5 cM. In the remaining spring wheat populations LD decayed to half of its initial value within 6-9 cM range (Figure 5B). The levels of initial LD in the winter wheat populations were similar to those in the spring wheat population with r2 ranging from 0.42 to 0.49 (Figure 5C) and decaying to 50% of these initial values within 7-9 cM.

The extent of LD varied greatly across the wheat genome even among closely linked SNP loci. Using the method described by Gabriel et al. [54], we identified only two genomic regions harboring 5 or more SNP loci that show little evidence of historical recombination events. These LD blocks shared by winter and spring wheat populations were located on chromosomes 2A and 3B (Figure 6). Pairs of loci with elevated LD tended to be localized near the centromere in regions bearing low recombination rates. The LD data for each chromosome in the population of 478 wheat cultivars are summarized in the Additional File 9.xls.

Figure 6
figure 6

Pair-wise LD for the wheat chromosomes 2A and 3B. Colored rectangles represent the squared correlation r2 between a pair of SNPs. The values of r2 are color-coded according to the color-key provided below the LD maps.

Discussion

Genetic diversity and population structure

Our study provides an overview of genetic variation in US and CIMMYT spring and winter wheat cultivars using genome-wide distributed SNP markers. Here we confirmed the utility of the wheat OPA for genotyping large populations of hexaploid wheat lines [32]. Most of the SNPs that were polymorphic within the complete set of 478 cultivars were also polymorphic in all subpopulations. Of the 849 polymorphic SNPs, 89% were polymorphic in both spring and winter wheat populations and from 70% to 85% were polymorphic across populations. Such a widespread distribution of polymorphic loci among populations suggests that the SNP discovery performed in a set of genetically diverse wheat landraces and wild emmer wheat [19, 36] was successful in recovering alleles represented in both growth habit groups. However, the distribution of MAF showed a higher proportion of medium frequency alleles in the spring wheat than in the winter wheat population. Currently, it is not clear whether this observed bias is caused by historical events, such as demography or selection, or if it is the result of ascertainment schemes applied during SNP discovery process. If the latter is true, the bias is likely small given the high proportion of polymorphic SNPs shared between spring and winter populations.

The proportion of genetic differentiation explained by growth habit (9.7%) was only slightly lower than the proportion of variation among subpopulations within the growth habit groups (12.9%). Historical gene flow between the spring and winter wheat groups during crop improvement and breeding can potentially be responsible for the low level of genetic differentiation between these populations. The proportion of variance between growth habit and among populations was significantly lower than the within-population genetic variance, indicating that each of the breeding programs included in our study employs genetically diverse lines. These results also indicate that the polymorphic SNPs included in the wheat OPA are represented in most populations and, therefore, will be useful for genotyping diverse collections of wheat cultivars.

In spite of the high proportion of shared SNPs among populations as well as small among-population genetic variance components, the model-based clustering approach was able to successfully assign cultivars to clusters. The clustering analysis performed using the whole genome SNP set produced more genetically distinct clusters than clustering obtained with smaller sets of SNPs from the A-, B- or D- genomes. Although cultivars can be optimally clustered at the same value of K using the A- and D- genome SNP sets, the proportion of genetic ancestry of cultivars in these clusters was variable for different SNP sets, implying that the three wheat genomes have different degrees of genetic differentiation among breeding programs. This outcome may be a consequence of inadequate representation of SNP alleles within a particular genome in different populations, or alternatively, can reflect the different impact that demography, population and breeding history had on genomes of wheat lines. Strong selection for adaptation to diverse environmental conditions, together with different founders and introgression histories can modulate the differentiation of allelic frequencies among breeding populations and genomes and result in the slightly different clustering patterns obtained here using the three genome-specific SNP sets. A similar trend was documented for US, Australian and UK varieties using DArT markers [63], which showed that wheat genomes are differentiated in allelic frequency among national breeding programs [64].

Even when the full SNP data set was used, wheat cultivars in the 17 wheat populations rarely shared the same membership coefficient in the inferred clusters reflecting the complexity of the breeding histories of lines included in this analysis. Most wheat lines showed evidence of admixture with the portions of their genomes assigned to 2 - 4 different inferred clusters, which is an expected result of the frequent crosses used in wheat breeding programs between adapted germplasm and the donors of different traits. When the whole population was forced to divide into two groups (K = 2), the clusters aligned mainly by growth habit, with most spring and winter wheat cultivars being assigned to separate clusters. The same grouping by growth habit was also apparent when the analyses were performed separately for the A- and B-genome SNPs. However, the informativeness of D-genome SNPs for the separation of spring and winter varieties was low, likely a consequence of low allelic frequency differentiation between these two wheat groups as evidenced by the low inter-population FST obtained for the D-genome.

Clustering using the combined SNP set showed that the inferred number of clusters in our population is smaller than the number of pre-defined breeding populations, largely due to the fact that breeding programs from the same region tend to use cultivars of common ancestry. The spring wheat population included cultivars from breeding programs targeting more geographically separated areas that were also more genetically differentiated that the winter wheat population. A high level of genetic differentiation was observed between the populations originating from two major geographical locations one including northern states SD, MN and MT and the other including Mexico and western states WA, ID and CA. In contrary, the winter wheat populations largely originating from the central states showed higher levels of admixture and a lower extent of genetic differentiation.

Genetic differentiation of spring and winter wheat

The characterization of FST across chromosomes provided additional insights into the structure of genetic variation between the spring and winter wheat populations. Assuming the same evolutionary processes affect neutral loci, identifying genomic regions showing elevated FST between spring and winter wheat populations should make it possible to localize the targets of selection controlling growth habit phenotype. However, the substantial heterogeneity of FST estimates for SNP loci across the wheat genome make it impossible to use single-locus FST values for detecting past selection events. This problem was circumvented by calculating FST for a group of sequential SNP loci which was shown to be an efficient strategy to reduce variation in FST estimates relative to estimates based on individual loci [58]. The highest degree of genetic differentiation was identified for the loci mapped to the wheat chromosome 5A, which probably results from the presence of Vrn1 gene locus, the major gene involved in regulation of flowering time in wheat [60]. This locus is responsible for most of the natural variation in the growth habit in hexaploid wheat [65, 66]. Additional regions showing unusually high level of genetic differentiation between spring and winter wheat lines were detected on the chromosomes 2A, 2B, 6B and 7B (Table 5). Three out of seven regions with elevated FST were co-localized with previously mapped genes known to be involved in flowering time regulation. Some wheat chromosome 6B substitution lines are known to affect flowering time in the absence of vernalization [67], but since the responsible gene has not been mapped, it is not possible to determine if this gene locus overlaps with high FST region identified on the chromosome 6B in this study. Although the distribution of empirical FST estimates cannot serve as a formal test for selection, this finding suggests that high FST genomic regions can harbor genes subject to diversifying selection providing good targets for further studies.

The genetic differentiation of some of the genomic regions can also be due to structural rearrangements abundant in one of the populations. For example, chromosomal inversions are known to be a major barrier for gene flow between populations due to limited recombination near the affected genomic regions and also one of the mechanisms facilitating reproductive isolation and species formation [68]. Previously it was demonstrated that pericentomeric inversion polymorphisms are widespread in wheat [69]. We found that one of these inversions overlaps with one of the regions with elevated FST detected on the chromosome 6B. This structural rearrangement can potentially impact the frequency of allele exchange between the spring and winter wheat populations and contribute to the genetic differentiation of this genomic region.

Linkage disequilibrium

Using genome-wide SNP data we demonstrated the extensive amount of LD in the populations of wheat cultivars. The variation in the patterns of LD among the populations and wheat genomes reflects the complexity of evolutionary and breeding history of wheat [70]. The extent of LD and LD decay estimated using SNP loci combined from all three wheat genomes was similar in both spring and winter wheat populations. In the analyses using the individual genomes, the differences between spring and winter lines in LD decay to 50% were also very small varying from no difference in the B-genome (7 cM both) to 1.3 cM in the A-genome (6.3 cM spring and 5 cM winter).

Analyses of LD decay by breeding population showed similar profiles among populations except for the CIMMYT population, which had the lowest LD among completely linked loci and the slowest rate of LD decay. A possible explanation for this observation is the intensive usage of synthetic wheat lines in the CIMMYT program. Synthetic wheats are generated by hybridization of diverse tetraploid (A- and B-genomes) and Ae. tauschii (D-genome) accessions followed by chromosome duplication using colchicine. The synthetic wheats and their derivatives have greatly increased genetic diversity in hexaploid wheat, particularly in the D-genome [7173]. It is well known that the introduction of new haplotypes from divergent populations can increase the extent of LD [74].

Depending on the genomic location of genes controlling important adaptive traits, these broad crosses can have a differential impact on LD in different genomes. For example, because the Vrn-A1 gene has a stronger effect than the Vrn-B1 gene, it has higher number of widely distributed haplotypes [66] and is thus more likely to have a stronger effect on LD. Therefore, the divergence in the extent of LD between wheat populations is probably related to unique breeding histories and selection pressures applied to genes located in the different genomes during the process of cultivar development.

A genetic bottleneck may also increase the level of LD [2, 74]. The last polyploidization event resulting in the origin of hexaploid bread wheat approximately 8,000-10,000 years ago had a dramatic impact on the level of genetic diversity in the D-genome [19, 75] suggestive of strong population bottleneck. We hypothesize that the longer extent of significant LD in the D-genome compared to that in the A- and B-genomes in both spring and winter wheat populations can mostly be explained by this polyploidization event [19]. However, the difference in LD between the D-genome, and the A- and B-genomes in spring wheat was not as high as in winter wheat. This result can probably be explained by 1) the larger number of breeding cycles involved in the development of spring wheat cultivars than in the development of winter wheat cultivars, and/or by 2) the inclusion of synthetic-derived wheat cultivars in the CIMMYT spring population.

Rates of LD decay varied among populations, but as expected, individual populations showed higher overall levels of LD than the combined datasets. These higher LD levels were also reflected in elevated levels of long-range LD extending above 10 cM. Interestingly, across all populations, LD decayed to 50% of its initial value within relatively narrow genetic intervals ranging from 6 to 9 cM. This rate of LD decay is probably associated with the high level of genetic diversity used in the individual breeding program. The cultivars in all of these programs captured comparable number of recombination events resulting in fast erosion of LD. However, each population showed variation in the extent of long range LD which was highest in the SD winter wheat population and WB spring wheat population. As pointed out earlier, these differences are probably the consequence of breeding history and selection specific to each breeding program.

The comparison of the LD levels obtained in our study with results obtained in other studies dealing with wheat and other inbreeding crops was complicated by the differences in the type of markers used for genotyping, and by sample size variation in the different studies. Both factors can impact LD estimates. Previously reported LD estimates in wheat were obtained using more polymorphic SSR markers [22, 24, 25]. In a sample of 43 U.S. spring and winter wheat cultivars it was shown that 70 out of 123 SSR loci (57%) with significant LD were linked at <10 cM [22]. In our study 86% of SNP loci (211/246) showing significant LD in the combined population of spring and winter wheat were located at less than 10 cM. The larger proportion of alleles with significant LD at less than 10 cM detected in our study is most likely due to sample size differences across the two studies (478 vs. 43 lines) used to estimate significant LD. The extent of significant LD in our population was more than 4 times higher than the SNP-based estimates obtained for a population of 91 European spring and winter cultivated barley [3]. These results suggest that the genetic diversity and number of recombination events in European barley germplasm are significantly higher than in the sample of U.S. and CIMMYT wheat cultivars. Therefore, association mapping studies in wheat would require a smaller number of markers per unit of genetic distance than needed in cultivated barley.

Variation in the extent of LD along the chromosome affect the number of tagSNPs (subset of SNPs that capture a large fraction of the allelic variation of all SNP loci [76]) required in each genomic region to ensure that causal mutations are in LD with neighboring SNPs. The interaction of many factors affecting the rate of LD decay in the different parts of the genome complicates the determination of the number of tagSNPs required to gain sufficient power for genome-wide association mapping. Estimates of this number for autogamous plant species varied from 9,600 to 29,400 SNPs for soybean cultivars [6] to 250,000 for the more diverse Arabidopsis natural populations [4]. LD values of 0.8 or higher have been recommended as an acceptable threshold for tagSNP selection [77]. In our study, for loci located from 0.0 to 0.2 cM apart, the median LD was approximately 0.8. If markers are evenly distributed at 0.2 cM intervals, the causative mutation would be found at about 0.1 cM from one of the flanking markers and have an approximate LD of 0.8. In a 3,500 cM hexaploid wheat map, placing markers at 0.2 cM will require at least 17,500 markers. This number would vary depending on if more liberal or conservative LD thresholds were selected.

The evolutionary history of an allele also has a strong impact on the probability of detecting marker-trait associations. Alleles of loci that are involved in local adaptation and subjected to recent selection can be more readily detected using an even more sparsely distributed set of markers. For example, marker-trait associations of alleles involved in the regulation of flowering time in Arabidopsis [78] and cultivated barley [3] were detected using a relatively low number of SNP markers. Genome-wide re-sequencing efforts similar to the ones performed for Arabidopsis [4], rice [1] and maize [79] will be required to provide comprehensive information for tagSNP selection in wheat. These efforts will also need to be complemented by assessment of the portability of selected tagSNPs to other populations. Otherwise, inadequate genome coverage may result in failure to identify critical associations [78, 80]. The possibility of performing GWAM in the large polyploid wheat genome will be tested in future using a larger panel of up to 9,000 genome-wide distributed SNP markers currently under development.

Conclusions

Our study demonstrated a high level of genetic diversity and relatively fast decay of LD within each wheat breeding program. Extensive exchange of genetic material between breeding programs resulted in a low level of genetic differentiation between populations of spring and winter wheat. The regions of the wheat genome harboring flowering time QTL demonstrated the highest levels of genetic differentiation between the spring and winter wheat populations. Breeding, selection and founder effect had a different impact on the wheat genomes in distinct populations, highlighting the significance of allopolyploidy for the development of cultivars adapted to a broad range of environmental conditions. Assessment of the extent of LD and population structure in the assembled panel provided valuable information for the design of GWAM experiments in wheat.