Analysis of gene-derived SNP marker polymorphism in US wheat (Triticum aestivum L.) cultivars
- First Online:
- Cite this article as:
- Chao, S., Zhang, W., Akhunov, E. et al. Mol Breeding (2009) 23: 23. doi:10.1007/s11032-008-9210-6
- 763 Views
In this study, we developed 359 detection primers for single nucleotide polymorphisms (SNPs) previously discovered within intron sequences of wheat genes and used them to evaluate SNP polymorphism in common wheat (Triticum aestivum L.). These SNPs showed an average polymorphism information content (PIC) of 0.18 among 20 US elite wheat cultivars, representing seven market classes. This value increased to 0.23 when SNPs were pre-selected for polymorphisms among a diverse set of 13 hexaploid wheat accessions (excluding synthetic wheats) used in the wheat SNP discovery project (http://wheat.pw.usda.gov/SNP). PIC values for SNP markers in the D genome were approximately half of those for the A and B genomes. D genome SNPs also showed a larger PIC reduction relative to the other genomes (P < 0.05) when US cultivars were compared with the more diverse set of 13 wheat accessions. Within those accessions, D genome SNPs show a higher proportion of alleles with low minor allele frequencies (<0.125) than found in the other two genomes. These data suggest that the reduction of PIC values in the D genome was caused by differential loss of low frequency alleles during the population size bottleneck that accompanied the development of modern commercial cultivars. Additional SNP discovery efforts targeted to the D genome in elite wheat germplasm will likely be required to offset the lower diversity of this genome. With increasing SNP discovery projects and the development of high-throughput SNP assay technologies, it is anticipated that SNP markers will play an increasingly important role in wheat genetics and breeding applications.
Expressed sequence tag
Hard red spring
Hard white spring
Hard red winter
Hard White Winter
Polymorphism information content
Simple sequence repeat
Single nucleotide polymorphism
Soft red winter
Soft white spring
Soft white winter
The discovery of a large number of single nucleotide polymorphisms (SNPs) in humans has revealed the power of this technology to generate high-resolution genetic maps (Brookes 1999). Though SNPs are generally biallelic and thus often less informative than multi-allelic simple sequence repeats (SSRs), their sheer abundance makes possible the development of high density SNP genetic maps, providing the foundation for subsequent population-based genetic analysis (Rafalski 2002). The utilization of multi-SNP haplotypes can offset the relatively low information content of single SNP loci (Brumfield et al. 2003). Another advantage of the SNP markers is that they do not depend on estimates of fragment lengths, a requirement of SSR markers that has limited the standardization of such data across different laboratories and equipment.
The large amount of plant sequencing data available in public databases represents a rich resource for SNP discovery using bioinformatics approaches. In rice, for example, thousands of candidate polymorphisms were identified by comparing the draft genome sequences from indica and japonica subspecies (Feltus et al. 2004). For other plant species, extensive expressed sequence tag (EST) collections provide an alternative source for the in silico detection of SNPs (referred to hereafter as eSNPs). Indeed, the alignment of ESTs from different cultivars of maize (Zea mays L.) has been useful to develop eSNPs for the functional part of the maize genome (Batley et al. 2003). In barley, too, over 3,000 EST-derived eSNPs have been identified, genetically mapped, and subsequently used to assess genetic diversity and the extent of genome-wide linkage disequilibrium (Rostoks et al. 2006; Hayes and Szucs 2006; Tim Close, personal communication). Genome-wide maps comprised of large numbers of SNP markers have also been reported in Arabidopsisthaliana (Cho et al. 1999), rice (Oryza sativa L.) (Nasu et al. 2002), and soybean (Glycine max L.) (Choi et al. 2007).
SNP densities in plants vary widely, as revealed by several SNP discovery studies. In out-crossing species, some average SNP densities include: 1 SNP/47 bp in non-coding regions of 36 inbred lines of maize (Ching et al. 2002); 1 SNP/73 bp in regions corresponding to 592 unigenes of 12 inbred maize lines (Vroh Bi et al. 2006); and 1 SNP/72 bp in 315 EST-derived loci of a panel of 13 lines of sugar beet (Beta vulgaris L.) (Schneider et al. 2007). The average SNP densities among self-pollinated species, however, tend to be lower: an estimated 1 SNP/270 bp in non-coding and random genomic regions of 25 soybean cultivars (Zhu et al. 2003) and 1 SNP/200 bp in 870 unigene-derived genomic regions of eight diverse accessions of barley (Rostoks et al. 2005). In wheat, one study comparing the sequences of 21 genes across 26 diverse germplasm accessions revealed an average of 1 SNP/330 bp in genic regions (Ravel et al. 2006), while a different study using a smaller and less diverse germplasm sample (12 genotypes) discovered an average of 1 eSNP/540 bp of wheat EST regions (Somers et al. 2003).
Further large-scale eSNP discovery in wheat is limited by both the polyploid nature of the organism and the high sequence similarity found among the three homoeologous wheat genomes (Somers et al. 2003). In an effort to complement the eSNP discovery strategy, a wheat SNP discovery project (http://wheat.pw.usda.gov/SNP) was funded by the National Science Foundation (NSF) to discover SNPs within intronic regions of wheat genes. The strategy used by this group of researchers was to sequence homoeologous gene regions from the three wheat genomes, develop genome-specific primers to amplify intronic regions, and use those primers to screen a diverse collection of 13 Triticum aestivum accessions originated from different parts of the world and eight synthetic wheats. Previously, a similar strategy of developing genome-specific primers based on intron sequences was used successfully to detect SNPs within starch biosynthesis genes in wheat (Blake et al. 2004).
In the current study, we used the genome-specific primers and SNP information generated by the NSF project to develop 359 new SNP-detection primers. Using the template-directed dye-terminator incorporation assay with fluorescence polarization detection (FP-TDI) (Chen et al. 1999), we assessed the level of SNP polymorphism across 20 US adapted wheat genotypes representing seven market classes. The genetic diversity of these US cultivars was then compared to that of the same 13 diverse accessions used in the wheat SNP discovery project as a means of investigating the broad effects of modern breeding selection on wheat genome diversity.
Materials and methods
Number and percentage of polymorphic SNP markers detected for each pair of parental lines included in this study
No. (%) polymorphic markers
No. markers scored
Rio Blanco (HWWa)
NY18/Clark’s Cream 40-1 (SWW)
TAM 105 (HRW)
USG 3209 (SRW)
SNP selection and genome-specific primer validation
The T. aestivum SNPs used in this study were ascertained during the NSF Wheat SNP project from a panel including the 13 common wheat accessions described above and eight synthetic wheats. There was no frequency cut-off applied to define SNPs. The SNP definition criterion and the use of a discovery panel that was more diverse than our target population (US commercial varieties) is expected to limit the impacts of ascertainment bias (Brumfield et al. 2003).
We first developed SNP-detection primers for a set of ESTs that contain at least one polymorphism among the 13 accessions of the diverse germplasm set defined above. A second group of SNP-detection primers was then developed from ESTs that are not polymorphic in the diverse germplasm set but that contain at least one polymorphism among the synthetic and durum wheats included in the NSF Wheat SNP project. These two sets of SNPs were analyzed separately to determine if pre-selection of SNPs polymorphic in the diverse T. aestivum germplasm set indeed enhances the chance of finding polymorphism among US cultivars.
To ensure that the PCR fragments amplified by the genome-specific primers were derived only from the intended chromosomes, nullisomic-tetrasomic (NT) lines of CS (Sears 1954) were used to optimize the specificity of the PCR conditions to our PCR equipment. The optimized targeted regions were then amplified in the 20 wheat cultivars used in this study. All PCR reactions were performed using 50 ng of genomic DNA in 20 μl PCR reaction mix containing 1 unit of Taq polymerase, 1.5 mM MgCl2, 100 μM of each of the four dNTPs, and 5 pmol each of forward and reverse genome-specific primer. The PCR cycling used for most of the primer pairs included an initial denaturation step at 95°C for 3 min, followed by 10 cycles of touch down at 95°C for 20 s, from 63 to 58°C for 20 s (0.5°C decrease per cycle), and 72°C for 80 s. The touch down was followed by 36 cycles of 95°C for 20 s, 58°C for 20 s, and 72°C for 80 s. When these general conditions did not work, they were modified by extending the number of cycles to 40 and by varying the annealing temperatures from 56 to 66°C.
Whenever possible, SNP-detection primers were designed from regions ending one base immediately upstream from the polymorphic site on both DNA strands. The primers were designed with melting temperatures between 55 and 60°C and lengths between 25 and 30 bases. In cases where multiple SNPs were discovered within the same EST, SNP haplotypes among the wheat accessions were compared. If two haplotypes were detected, only one SNP was selected for assay design. However, if more than two haplotypes were observed within the discovery panel, SNP-detection primers were designed for assaying two different SNP sites informative for identifying different haplotypes among the accessions.
SNP detection and allele scoring
SNP detection was carried out using a single-base extension assay based on the method of template-directed dye-terminator incorporation assay with fluorescence polarization detection (FP-TDI) (Chen et al. 1999). An aliquot of the amplified genome-specific fragments was combined with primer extension reaction mix and 5 pmol SNP-detection primer from one DNA strand, following the protocols for the AcycloPrime II SNP detection kit provided by Perkin Elmer (Boston, MA) with two fluorescent dye-labeled nucleotides included allowing the two allelic variants of a specific SNP to be interrogated in a single assay. The primer extension reactions were carried out using an initial denaturation cycle at 95°C for 2 min, followed by 20 cycles of 95°C for 15 s and 60°C for 30 s. At the end of the assay, the reaction mix was subjected to fluorescence polarization (FP) measurements using a Perkin Elmer’s Victor V plate reader. The data analysis and allele calls based on clustering FP values were performed using the Excel workbooks from http://www.snpscoring.com.
SNP marker diversity for the US cultivars and the diverse germplasm set was measured using the polymorphism information content (PIC) formula proposed by Weir (1996) and implemented in the PowerMarker software (Liu and Muse 2005). PIC values, which provide an estimate of the probability of finding a polymorphism between two random samples of the germplasm, were also calculated separately for each chromosome and genome. The ratio between the PIC values for the US cultivars and the diverse germplasm set was used to estimate the relative decrease in diversity among the three genomes in this modern group of cultivars. To test the significance of the differences in diversity reduction among genomes, the PIC ratios for individual loci were treated as replications in a non-parametric Kruskal–Wallis rank sum test (implemented using the R statistical package, http://www.r-project.org). SNP frequency distributions for the different genomes were compared using χ2 tests.
The SNP PIC values were also compared with PIC values obtained from a previous study including 242 wheat genomic SSR markers and the same set of US cultivars (Chao et al. 2007). Distance matrices for the 20 cultivars were calculated for SSR and SNP markers separately using Rogers’ distance, and the correlation between matrices was determined using the Mantel test (Mantel 1967). These calculations were performed using the PowerMarker software.
In this study, we assayed a total of 364 ESTs, including 145 for which we designed SNP primers for two different polymorphic sites. For each EST, the chromosome bin location, SNP position and DNA strand, and SNP-detection primer sequence are available in supplementary Table S1. Only the primer from the strand resulting in unambiguous allele discrimination is reported.
Primers from 350 ESTs yielded unambiguous genotype calls in the FP assays, with at least one of the two primers designed for each polymorphic site. The SNP assays for the remaining 14 ESTs failed to give clear genotype calls, which is more likely due to a technical failure than to a problem with the SNP primer specificity. Nevertheless, the conversion rate from discovered SNPs to working assays depends on many factors, such as the levels of sequencing errors, sequence compositions near the targeted SNPs, and the genotyping systems used. Based on our results, we estimate that the single-base extension method can yield an overall assay success rate of 96% in wheat cultivars when primers for both strands are tested.
SNP marker polymorphism in US wheat cultivars
Of the 145 ESTs for which we designed SNP primers for two polymorphic sites, 31 were found to be polymorphic at both sites for the 20 genotypes used in the study. These two polymorphisms defined two haplotypes for 22 of the ESTs and more than two haplotypes for the other nine. Out of the remaining 114 ESTs, 57 were found to be polymorphic for only one site, 51 were monomorphic for both sites, and six failed. This result indicates that assaying two different SNPs per EST (each for both strands) increases the chance of finding polymorphic SNP markers by at least 16% (57/350), suggesting that an optimization step using primers from both strands is worthwhile. The results further showed that among the ESTs with more than one SNP assayed, only a small portion detected more than two haplotypes (9/145) among wheat cultivars, thereby providing evidence that the extent of linkage disequilibrium (LD) is likely extensive within the population of wheat cultivars used in this study. The extent of LD is expected to decrease within populations of more diverse germplasm such as those used for SNP discovery. Altogether, 359 SNPs (341 ESTs with 1 SNP and nine with 2 SNPs) yielded unambiguous genotype calls. Call quality was further assessed by determining the rate of high quality calls among the 20 genotypes assayed in the study. Over 70% of the 359 SNPs yielded high quality calls for all 20 genotypes. Among the other 30%, 29% gave high quality calls for 17–19 genotypes, whereas only 1% of the SNPs assayed exhibited high quality calls for 13 to 16 genotypes. None of the selected SNPs gave less than 13 high quality calls among the 20 genotypes examined.
Of the 359 SNPs selected, 253 revealed at least one polymorphism among the diverse germplasm set, and the remaining 106 were polymorphic only in synthetic or tetraploid wheat accessions. These last 106 SNPs were selected for a balanced representation of the three homoeologous genomes (31 from the A genome, 36 from the B genome, and 39 form the D genome). The sets of 253 and 106 SNPs were evaluated separately to quantify the effect of pre-selecting polymorphic SNPs in the diverse germplasm set on the level of polymorphism detected among US cultivars.
Results from the full SNP dataset showed that 212 markers (59%) detected at least one difference among the 20 US wheat cultivars. This percentage increased to 70% when only those 253 SNPs pre-selected for polymorphisms were considered and dropped to 33% when considering only the remaining 106 SNPs. The full SNP dataset was also used to calculate the number of polymorphic markers and the level of polymorphism in the parental lines of the 10 mapping populations in order to estimate the number of polymorphisms that can be expected in mapping populations based on crosses between adapted US cultivars (Table 1). On average, 16.4% of the pairs of parental lines revealed polymorphisms for the 359 SNPs tested. The percentage was slightly higher (20.4%) for the subset of 253 pre-selected SNPs and much lower (6.5%) for the remaining 106 SNPs. Considering all the SNPs, the highest level of polymorphism (23.4%) was found between two parental lines belonging to different growth habits (spring line Grandin/ND614 and winter line NY18/Clark’s Cream 40-1) (Table 1).
Genetic diversity of SNP markers in US wheat cultivars
Allele number and PIC values calculated for 359 SNP and 242 SSR markers in a set of 20 US wheat cultivars
Comparison of Polymorphism Information Content (PIC) values among 20 US wheat cultivars and a set of 13 diverse common wheat accessions
Relative reductions in SNP marker diversity per genome
To investigate the relative changes in SNP diversity among the different genomes, we compared diversity values obtained for the 20 US wheat varieties with those obtained for the diverse germplasm set. The subset of 253 SNP markers was used for this analysis. Since this subset includes only polymorphic markers, the PIC values cannot be used to estimate the natural diversity of the diverse germplasm set or the decrease in diversity among the US cultivars. However, the comparison of the reduction in genetic diversity among genomes is still valid because of its relative nature.
The ratios of PIC values from the panel of US cultivars and the diverse germplasm set for SNPs grouped by genome (PICcultivar/PICdiverse set) revealed that a higher amount of SNP diversity was retained in both the A genome (77%) and the B genome (62%) than in the D genome (50%). Using the PICcultivar/PICdiverse set ratios of individual loci as replications in a non-parametric Kruskal–Wallis ANOVA, we confirmed the absence of a significant difference between the A and B genome values (P > 0.05). However, the PICcultivar/PICdiverse set ratios for the D genome SNPs were significantly lower than those in the other two genomes (P < 0.05).
To test if these differences were significant, we performed a χ2 test comparing the number of SNPs per quartile among the different genomes. The A and B genomes showed no significant differences (P > 0.05), whereas the D genome differs significantly from the other two genomes (P < 0.05) (Fig. 1). These results confirm that, in the diverse germplasm set, SNPs in the D genome have a different frequency distribution than those in the A and B genomes.
Comparison between SSR and SNP markers in wheat
In a previous genetic diversity analysis we characterized 43 wheat genotypes with a set of 242 SSR markers distributed across all 21 chromosomes (Chao et al. 2007). From this data set we selected the 20 US cultivars included in this study and recalculated the diversity values. The average number of alleles per SSR marker was 5.5 and the mean PIC value of 0.63 (Table 2), more than 3-fold larger than the value obtained for the SNP markers (PIC = 0.18).
We also compared the genetic relationships among the 20 US cultivars inferred from the two sets of markers. The Rogers’ distance matrices based on SSR and SNP markers were moderately correlated (Mantel test R = 0.42, P < 0.0001), indicating that the two matrices contain common information. However, the intermediate R value indicates that the relationships among germplasm inferred from the two different marker systems can be quite different.
SNP polymorphism in wheat
The quantification of the SNP polymorphism among elite wheat cultivars is important to estimate the utility of SNP markers in commercial wheat breeding programs. The level of polymorphism determines the proportion of useful SNPs and, therefore, affects the cost of using SNPs to develop genetic maps, perform association studies, or use them in marker assisted selection.
The average PIC value of 0.18 found in our study indicates that, on average, one in five to one in six of the intron-derived SNPs are expected to be polymorphic between any two US common wheat cultivars. This value is considerably lower than the mean PIC value of 0.27 estimated by Somers et al. (2003), a discrepancy most likely due to the more diverse genotypes used in their study. Somers et al. (2003) included wheat cultivars from different countries and, more importantly, a synthetic wheat. Synthetic wheats are developed by hybridizing tetraploid wheat with Aegilops tauschii accessions, capturing greater diversity than the one currently present in the T. aestivum germplasm. The germplasm set in this study, by comparison, was comprised of elite cultivars adapted to different production regions in the US and included no synthetic wheats.
The average PIC values of 0.18 found with the complete SNP set increased to 0.23 when calculations were based only on the 253 SNPs pre-selected for polymorphisms among the diverse germplasm set. In contrast, the PIC values found using the 106 SNPs that were polymorphic only in synthetic and durum wheats in the SNP discovery project were less than one-third (PIC = 0.07) of the previous values. From these results, we conclude that studies using commercial wheat cultivars and a limited number of SNPs can benefit greatly from selecting only those SNPs that are polymorphic among the diverse germplasm dataset. That being said, studies limited by the number of available SNPs can certainly find some additional polymorphism within the SNPs showing polymorphisms in synthetic and tetraploid wheat; for within this set of 106 SNPs, 36 (34%) showed at least one polymorphism among the US cultivars.
Comparison between SNP and SSR diversity values
The SNP PIC values were found to be approximately three times lower than those based on SSR markers (Table 2). This agrees well with published results suggesting that for linkage studies approximately three times as many SNPs are needed in comparison to SSRs (Kruglyak 1997). SNPs are bi-allelic markers and, therefore, are limited to maximum PIC values of 0.5 (when both alleles have identical frequencies), whereas multi-allelic markers (e.g. SSR) do not have this limitation. In addition, nucleotide mutation rates in intronic regions of the wheat genome (average 5.5 × 10−9 substitutions nt−1 year−1, Dvorak and Akhunov 2005) are several orders of magnitude slower than the mutation rate of SSRs (2.4 × 10−4 repeats allele−1 generation−1, Thuillet et al. 2002). Finally, most of the SSRs used in the previous study (Chao et al. 2007) were derived from genomic regions, whereas the SNPs included in this study were selected exclusively from genic regions (introns), which tend to evolve more slowly than wheat intergenic regions (Dubcovsky and Dvorak 2007). Our previous study revealed significantly lower levels of polymorphism among EST-derived SSRs (~22%) than among genomic SSRs (>50%, Chao et al. 2007). A similar result has been reported in barley (Russell et al. 2004).
The different mutation rates, as well as the different genome regions represented in the SNP and SSR datasets, may contribute to the low correlation observed between the genetic distance matrices for US cultivars derived from these different types of markers. The higher mutation rate of the SSR markers may also explain the less severe reduction in diversity observed in the D genome compared with the SNP data (Table 2).
A way to increase SNP diversity values is to combine the information of multiple SNPs for a single locus by haplotype analysis (Brumfield et al. 2003). A two-fold increase of haplotype diversity over individual SNPs was found in maize (Ching et al. 2002). Higher haplotype diversities were also observed in sugar beet (Schneider et al. 2007), soybean (Zhu et al. 2003), and barley (Russell et al. 2004). As shown in humans, haplotype analysis offers the advantage of capturing most of the genetic variation across a region, and a minimal set of SNPs can be selected and used to distinguish common haplotypes in a block (Cardon and Abecasis 2003). Knowledge of haplotype structure in genic regions will also help to assess the extent of LD across genes in cultivated wheat.
Low level of polymorphism in wheat D genome
In this study, we found a lower level of SNP polymorphism in markers located in the D genome relative to those located in the A and B genomes in both the US cultivars and the diverse germplasm set. Using the 253 pre-selected SNPs, the ratio of the average PIC value for the A and B genomes to that of the D genome (PICAB/PICD) was found to be 1.7 for the US cultivars and 1.4 for the diverse germplasm set (Table 3). These ratios increased to 1.9 and 1.7, respectively, when the complete set of 359 SNP was considered (Table 2, data not shown), suggesting that the low PICAB/PICD value for the diverse germplasm set may be a result of the exclusion of non-polymorphic SNPs from this dataset. When the same calculation was made using a set of 1,228 genes from the NSF wheat SNP discovery project, including both polymorphic and non-polymorphic genes, the PICAB/PICD ratio for the diverse germplasm set increased to 2.2 (data not shown), a value more similar to the ones found in our study. These values suggest that the average PIC values for the A and B genomes are approximately two-fold higher than the PIC values for the D genome in hexaploid wheat.
The low level of diversity found in the D genome is expected from the evolutionary history of hexaploid wheat. Triticum aestivum originated less than 10,000 years ago from the hybridization of tetraploid wheat with a limited number of A. tauschii accessions (Dvorak et al. 1998; Talbert et al. 1998; Caldwell et al. 2004). After these few polyploidization events, limited gene flow occurred between A. tauschii and T. aestivum, whereas frequent hybridization and good fertility of the pentaploid hybrids allowed for continuous gene flow between T. aestivum and tetraploid wheat species, increasing the diversity of the A and B genomes relative to the D genome (reviewed in Dubcovsky and Dvorak 2007).
This study suggests that the selection of modern cultivars from the original hexaploid landraces resulted in an additional diversity bottleneck that had a stronger effect on the D genome than on the A and B genomes (Fig. 1, Table 3). Because it is unlikely that those differences arose by differential selection of genes located in a particular genome, the most likely explanation is differential effect of genetic drift during the diversity bottleneck that accompanied the development of the modern adapted germplasm due to preexisting differences on the proportion of low-frequency alleles among genomes. The reduction of effective population size during the bottleneck imposed by selection in modern breeding programs increased the chance of genetic drift, which in turn increased the probability of losing low frequency alleles from the population. The discovery that the proportion of low-frequency alleles in the D genome is higher than that in the A and B genomes in the diverse germplasm set may have determined a higher loss of allelic variants in the D genome (Fig. 1).
In summary, the characterization of wheat SNPs in commercial US cultivars is encouraging and suggests that SNP markers have adequate levels of polymorphisms to make them useful in genetic and breeding studies. This study indicates that additional SNP discovery efforts targeted to the D genome would likely be required to offset the lower diversity of this genome among the elite wheat cultivars. With increasing SNP discovery projects and the development of SNP assay technologies that can assay thousands of SNPs in parallel, it is anticipated that SNP markers will play an increasingly important role in wheat genetics and breeding applications.
This research was supported in part by the funds from the U.S. Department of Agriculture, Cooperative State Research, Education and Extension Service, Coordinated Agricultural Project grant number 2006-55606-16629 and NSF Grant No. DBI-0321757. We thank Dr. Jan Dvorak for facilitating early access to the SNP data generated by the NSF project and for his useful suggestions and ideas and to Iago Lowe for his thorough revision of the manuscript.