Development and mapping of SNP assays in allotetraploid cotton
- First Online:
- Cite this article as:
- Byers, R.L., Harker, D.B., Yourstone, S.M. et al. Theor Appl Genet (2012) 124: 1201. doi:10.1007/s00122-011-1780-8
- 2.3k Downloads
A narrow germplasm base and a complex allotetraploid genome have made the discovery of single nucleotide polymorphism (SNP) markers difficult in cotton (Gossypiumhirsutum). To generate sequence for SNP discovery, we conducted a genome reduction experiment (EcoRI, BafI double digest, followed by adapter ligation, biotin–streptavidin purification, and agarose gel separation) on two accessions of G. hirsutum and two accessions of G. barbadense. From the genome reduction experiment, a total of 2.04 million genomic sequence reads were assembled into contigs with an N50 of 508 bp and analyzed for SNPs. A previously generated assembly of expressed sequence tags (ESTs) provided an additional source for SNP discovery. Using highly conservative parameters (minimum coverage of 8× at each SNP and 20% minor allele frequency), a total of 11,834 and 1,679 non-genic SNPs were identified between accessions of G. hirsutum and G. barbadense in genome reduction assemblies, respectively. An additional 4,327 genic SNPs were also identified between accessions of G. hirsutum in the EST assembly. KBioscience KASPar assays were designed for a portion of the intra-specific G. hirsutum SNPs. From 704 non-genic and 348 genic markers developed, a total of 367 (267 non-genic, 100 genic) mapped in a segregating F2 population (Acala Maxxa × TX2094) using the Fluidigm EP1 system. A G. hirsutum genetic linkage map of 1,688 cM was constructed based entirely on these new SNP markers. Of the genic-based SNPs, we were able to identify within which genome (‘A’ or ‘D’) each SNP resided using diploid species sequence data. Genetic maps generated by these newly identified markers are being used to locate quantitative, economically important regions within the cotton genome.
High throughput DNA sequencing technology facilitates the rapid discovery of large numbers of single nucleotide polymorphism (SNP) markers at relatively low cost compared to other traditional approaches. Recently, a few different strategies employing high throughput sequencing have reported identifying large numbers of SNP markers (Barbazuk et al. 2007; Van Tassell et al. 2008), including some in organisms with little previous molecular research (Maughan et al. 2010; Bundock et al. 2009; Baird et al. 2008) as well as organisms with little genetic variation such as cotton (Udall et al. 2006). These strategies utilize transcriptome sequencing, gene-enriched sequencing using methylation sensitive digestion, or sequencing of reduced representational libraries (RRL), also termed genome reduction.
Some of these strategies target genic regions while others target both genic and non-genic sequences. Genic strategies primarily target transcribed sequences through the development of Expressed Sequence Tags (ESTs). Expressed sequences may contain limited amounts of SNPs due to purifying selection of genic regions. In addition, the location of SNPs discovered in ESTs is limited to the transcribed regions of the genome. Genic regions can be indirectly targeted and transcription biases can be avoided using methylation sensitive digestion of genomic DNA combined with sequencing. Gene-enriched sequencing using methylation sensitive digestion and sequencing of RRLs are similar techniques but vary in digestion specificity resulting in subtly different distributions of sequence types. RRLs of genomic DNA provide a method to isolate small, equivalent portions of the genome from two or more individuals regardless of gene expression or methylation state (Van Tassell et al. 2008; Wiedmann et al. 2008). Maughan et al. (2009) developed a genome reduction methodology that is based on restriction site conservation (GR-RSC) in which double-digested DNA is selectively amplified and size-selected on an agarose gel. GR-RSC libraries contain only a fraction of the entire genome and allow for identification of mostly non-genic SNPs while providing low sequence bias and fairly uniform genomic distribution (Maughan et al. 2009). This GR-RSC strategy could be used to identify non-genic molecular markers in the tetraploid cotton genome that are (1) well distributed throughout the genome, (2) variable between domesticated and undomesticated germplasm and (3) likely neutral with respect to natural (or artificial) selection.
Cotton is a major world agricultural crop, estimated at ~115 million bales (USDA 2011). In the United States, cotton fiber and seed by-product revenue accounts for an estimated five billion dollars annually (Wallace et al. 2009). Gossypium hirsutum (Upland cotton) and G. barbadense (Pima cotton) represent 96.7 and 3.3% of the total cotton fiber produced in the United States (USDA 2011). Both G. hirsutum and G. barbadense are allotetraploid (2n = 4x = 52) species of cotton and are composed of a AT (~1,700 Mb) and a DT (~900 Mb) genome. Polyploidy combined with a narrow germplasm base have hindered the development of SNP-based marker assays in cotton. SNP-based molecular markers offer the possibility of constructing dense genetic maps as well as facilitating map-based gene cloning efforts and haplotype-based association studies. In cotton, the most extensive work to date on SNP development reported the characterization and mapping of 270 SNPs based on EST sequencing (Van Deynze et al. 2009).
Here we report the discovery and application of thousands of SNPs in the allotetraploid genome of cotton. Our efforts of SNP discovery and application were circumscribed by four main objectives: (1) utilization of the GR-RSC methodology to identify the first large-scale set of SNP markers in cotton, (2) conversion of several hundred putative SNPs into functional SNP genotyping assays using KASPar genotyping chemistry, (3) evaluate the utility of these functional SNPs across a broad panel of domesticate and wild cotton accessions, and (4) develop the first genetic linkage map of G. hirsutum based solely of SNP markers.
Materials and methods
Four accessions were used for marker discovery in cotton. These accessions represent domestic and wild accessions of two species of allotetraploid cotton: G. hirsutum (Acala Maxxa and TX2094) and G. barbadense (Pima-S6 and K101). These accessions were selected for their agricultural significance (Brubaker and Wendel 1994) and historical relevance with regard to previous studies (Hovav et al. 2008; Rapp et al. 2009). The allotetraploid genome of cotton contains two genomes, AT and DT where the ‘T’ subscript indicates the genome in a tetraploid nucleus (Wendel and Cronn 2003). SNPs identified in the EST dataset were based on sequence data from the same G. hirsutum accessions (Acala Maxxa and TX2094) and additional diploid sequences from G. arboreum (A2 genome) and G. raimondii (D5 genome).
Additional plant materials include an F2 population and a diversity panel of G. hirsutum. An F2 population of 174 individuals was derived from a cross of the G. hirsutum parents Acala Maxxa × TX2094. A diversity panel of 48 accessions was created to represent the extant genetic diversity within G. hirsutum (Wendel et al. 1992). This panel includes representative domesticated accessions from the Mississippi delta, High Plains, and Eastern and Western United States. A broad representation of landraces and wild accessions was included to evaluate introgression potential of SNP markers with exotic germplasm.
Separate DNA extractions were performed for all samples using freeze-dried leaf tissue. Extractions for GR-RSC sequencing and F2 genotyping were performed using a cetyltrimethylammonium bromide (CTAB) extraction procedure scaled for 1.7 mL extractions (Kidwell and Osborn 1992). DNA of the diversity panel was extracted using the Qiagen DNeasy kit (Qiagen, Valencia, CA). Extracted DNA was suspended in DNase-free water and quantified using a NanoDrop spectrophotometer (ND 1000, NanoDrop Technologies Inc., Montchanin, DE).
Genome complexity was reduced using the GR-RSC method as described by Maughan et al. (2009). Briefly, total genomic DNA was double digested to completion using EcoRI and BfaI endo-restriction nucleases. BfaI and EcoRI site-specific adapters were ligated to the digested fragment’s sticky ends. EcoRI adapters included a biotin end which allowed EcoRI cut fragments (the less frequent cut site) to be selected for using a biotin–streptavidin magnetic bead separation, reducing genomic complexity by about 90%. Resulting fragments were then PCR amplified with adaptor specific primers that also contained multiplex identifier (MID) barcodes to allow for sample multiplexing on the Roche 454 pyrosequencing platform. Genomes were further reduced by selecting genomic fragments from the PCR reaction in the range of 450–600 bp via agarose gel separation. Samples were sequenced using Titanium reagents on the Roche 454 Genome Sequencer FLX at the BYU DNA Sequencing Center. Separate genome reductions were performed for each accession in the GR-RSC experiment (Acala Maxxa, TX2094, Pima-S6, K101).
Genomic fragment assembly
GR-RSC sequence reads were grouped into separate files based on their MID barcodes. Newbler de novo assembler v2.3 was used to create all of the GR-RSC sequence assemblies. Stringent assembly parameters of 97% sequence identity and 100 bp minimum overlap were used to minimize co-assembly of AT and DT homoeologous sequences. Combined GR-RSC assemblies were created to identify SNPs within G. hirsutum (between Acala Maxxa and TX2094), within G. barbadense (between Pima-S6 and K101), and between the two species (G. hirsutum and G. barbadense). Since less sequencing was performed on the G. barbadense accessions, a subset of G. hirsutum reads, referred to as “reduced G. hirsutum” hereafter, was created and used to form a fourth combined assembly. This “reduced G. hirsutum” assembly consisted of a random subset of reads, comparable both in number of reads and total bases to the G. barbadense assembly. The reduced assembly eliminated assembly size bias and allowed for direct comparison of results between the G. hirsutum and G. barbadense assemblies. Separate assemblies were also created for each of the four accessions.
To remove repetitive sequences from our analysis, the combined assemblies were run through RepeatMasker (Smit et al. 1996–2010). Categorized repeats for Gossypium (Grover, personal communication) were included along with the Arabidopsis repeats in our RepeatMasker database. All contigs which contained repetitive fragments were excluded from the assemblies used for SNP discovery. Contigs were also screened by reference mapping consensus sequences to the chloroplast genomes of the G. hirsutum and G. barbadense and the mitochondrial genome of Arabidopsis using 454 gsMapper (v2.3) prior to SNP analysis.
Identification of SNPs and microsatellites
Potential SSRs were identified in assemblies of the G. hirsutum and G. barbadense using MISA v.1.0 (http://pgrc.ipk-gatersleben.de/misa) with a unit size/minimum number of repeats threshold of 2/6, 3/5, 4/5, 5/5, 6/5 and a maximal number of bases interrupting 2 SSRs in a compound microsatellite of 100. Mono-repeats were not reported because 454 homopolymer sequencing errors would be confounded with SSR loci.
SNP assay design
The KASPar (KBioscience Ltd., Hoddesdon, UK) assay was used to convert a portion of identified SNPs and estimate a conversion rate of putative SNPs to functional assays. KASPar assays were developed to target 1,052 genome-specific SNPs identified between accessions of G. hirsutum (Acala Maxxa and TX2094; Supplemental Table 1). All assay primer sets were designed using PrimerPicker (KBioscience 2009) with default parameters. Of the 1,052 assays, 704 were designed to target SNPs from the GR-RSC G. hirsutum assembly while the remaining 348 were designed to target G. hirsutum SNPs located in EST sequences.
Because diploid sequence data from related species existed, two different strategies were employed in the development of the 348 EST SNP assays. In the first strategy, 192 of the assays were intended to amplify a single locus in a single genome with coincidental amplification of the non-target genome as background ‘noise’. In many of these SNP assays the resident genome was identified using diploid sequence information, hereafter referred to as genome-identified (GI) SNP assays. In contigs where homoeologs co-assembled, diploid sequence data were used as a reference to categorize tetraploid reads by genome (AT or DT) as indicated by genome distinguishing SNPs (polymorphisms which differed between genomes, but were identical between accessions) occurring in the same tetraploid read. Based on this categorization, the base identity of the minor allele identified the genome of the SNP assay (e.g. the major allele was found in both AT reads and DT reads but the minor allele was only found in DT reads, thus the resident genome of the SNP was DT). In contigs where homoeologs separately assembled, co-assembly of diploid reads identified the resident genome of the SNP (e.g. only A2 reads resided in the contig, thus the resident genome of the contig and SNP was AT). While only 192 assays were designed using this strategy, the putative genome for many thousands of SNPs was identified in the EST assembly (Flagel et al. 2011).
Subsequent genetic mapping of SNP assays from both design methods determined how accurately a single locus in a single genome could be targeted. Agreement of multiple predictive markers in linkage (e.g. five linked, targeted assays, all of which predict the AT genome) was used as an indication of success.
Genotyping and genetic mapping
Assay screening and genotyping were performed on two different platforms. Initially, a small set of genomic SNPs (20) was validated using traditional KASPar with a 384-well plate reader. Subsequent, large-scale screening and genotyping of SNPs were then performed on Fluidigm 96.96 Dynamic Arrays using the genotyping EP1 System (San Francisco, CA). Fluorescence intensity was measured with the PHERAstar plus (BMG LABTECH, Durham, NC) microplate reader or the EP1 (Fluidigm Corp, San Francisco, CA) reader and plotted in two axes. Genotypic calls from PHERAstar measurements were made in KlusterCaller (KBioscience Ltd., Hoddesdon, UK) while genotypes based on EP1 measurements were made using the Fluidigm SNP Genotyping Analysis (Fluidigm 2011) program.
All functional SNP assays were used to genotype the F2 population and 277 co-dominant assays between Acala Maxxa and TX2094 were used to genotype the 48 accessions of the G. hirsutum diversity panel. All genotype calls were manually checked for accuracy and ambiguous data points that failed to cluster were scored as missing data. A genetic map was constructed using regression mapping in JoinMap4 (Van Ooijen 2006). Markers which had greater than 30% of their genotypic data missing were excluded during the mapping process. A minimum LOD threshold of 5.0 was used and linkage distances were corrected using the Kosambi mapping function.
Sequencing and assembly of GR-RSC reads
Summary of GR-RSC sequence assemblies
Assembled bases (Mb)
Assembly length (Mb)
G. hirsutum assembly
Acala Maxxa, TX2094
Reduced G. hirsutum assembly
Acala Maxxa, TX2094
G. barbadense assembly
Maxxa, TX2094, S6, K101
The sequence data were assembled to form multiple GR-RSC assemblies (Table 1). The G. hirsutum assembly (Acala Maxxa and TX2094) resulted in 79,953 contigs with an N50 contig length of 516 bp while the G. barbadense assembly (Pima-S6 and K101) resulted in 51,307 contigs with and N50 contig length of 491 bp. Comparing the G. barbadense assembly with the results of the reduced G. hirsutum assembly, the reduced G. hirsutum assembly formed slightly more contigs (55,160) with an N50 contig length of 513 bp. The combined, inter-specific assembly (G. hirsutum vs. G. barbadense) resulted in 112,506 contigs from 1.25 million reads with an N50 contig length of 508 bp. The percent of bases that assembled ranged from 51.1% in the reduced G. hirsutum assembly to 59.9% in the inter-specific assembly. Assemblies with a greater number of input reads had greater percentages of bases incorporated into their alignments. Read depth between the two accessions within an assembly was compared and most contigs in the combined assembly were found to contain reads from both accessions (e.g. G. hirsutum assembly, Supplemental Fig. 1), suggesting that the genome reduction was successful in isolating homologous regions from the sampled accessions.
GR-RCS SNP discovery
Summary of GR-RSC SNP discovery
Contigs with SNPs
SNPs per contig
Acala Maxxa and TX2094
Reduced G. hirsutum
Acala Maxxa and TX2094
Pima-S6 and K101
(Maxxa, TX2094) and (S6, K101)
Maxxa, TX2094, S6, K101
The SNP frequency, calculated as the number of SNPs in assembly divided by length of assembly, ranged from 0.0001 in the intra-specific assembly of G. barbadense to 0.00067 in the inter-specific assembly of G. hirsutum and G. barbadense (Table 2). These observed frequencies were not unexpected and reflect the narrower genetic base of the intra-specific G. barbadense comparison and higher genetic diversity of the inter-specific comparison. We note that the frequencies reported here are most likely underestimates due to conservative nature of SNP identification parameters.
Transition mutations (A ⇔ G, or T ⇔ C) are defined as a change from a purine to a purine or a pyrimidine to a pyrimidine, while transversion mutations (e.g. A ⇔ T, A ⇔ C, G ⇔ T, G ⇔ C) are defined as a change from a purine to a pyrimidine or a pyrimidine to a purine. Nucleotide transitions naturally account for the majority of observed SNPs and are thought to be driven by hypermutability effects of CpG di-nucleotide sites or deamination of methyl cytosine and entropy constraints (Li 1997). In all four combined GR-RSC assemblies, transitions were the most common SNP type, with transition-to-transversion ratios of 2.3:1. These ratios are similar to those recently found in human, maize, and amaranth (Maughan et al. 2009; Morton et al. 2006; Zhang and Zhao 2004).
EST SNP discovery
A de novo assembly of ESTs that included Acala Maxxa, TX2094, G. arboreum (A2 genome) and G. raimondii (D5 genome) sequences provided a basis for SNP discovery in coding regions (Flagel et al. 2011). The joint assembly of diploid (A2 and D5) and tetraploid ESTs allowed for identification of genome-specific SNPs in contigs of both separate and co-assembled homoeologs. A total of 3,319 SNPs were identified between Acala Maxxa and TX2094 in contigs where homoeologs did not co-assemble. In contigs of co-assembled homoeologs, 1,009 SNPs were identified between Acala Maxxa and TX2094.
SNP assay development
The EST-based assay conversion rates were similar to the GR-RSC assay conversion rate. Of the two types of EST SNP assays, 156 GT SNP assays and 192 GI SNP assays, 50 (32.1%) and 59 (30.7%) met a χ2 test for 1:2:1 or 3:1 segregation, respectively. Of the remaining 691 assays which did not segregate as expected for an F2 population, the vast majority (86%) failed to amplify or separate into clusters while the remainder (14%) formed clusters, but the clusters did not conform to a 1:2:1 or 3:1 Mendelian pattern of inheritance, though they were used for genetic mapping (below). Some of these non-conforming assays may actually represent functional SNP assay that are simply linked to strongly skewed genomic regions (segregation distortion) in this F2 population. Skewness of molecular markers has been attributed to chromosomal regions containing possible gametophytic or zygotic viability factors (Lu et al. 2002; Zamir and Tadmor 1986) and/or underlying genetic factors (i.e., quantitative trait loci) conferring a selective advantage for the particular growing conditions used to produce the mapping population.
SNP assay utility
To characterize the applied potential of these SNP assays in cotton breeding, the SNP assays were screened in a panel of 48 diverse G. hirsutum accessions (Supplemental Table 2). Several observations can be made from the observed genotypic patterns. First, of the 48 accessions genotyped no two individuals shared the same genotype across all assays (277 co-dominant SNP assays). Second, several accessions shared many wild alleles with TX2094 (the wild parent of the F2 mapping population), with the most similar individual, TX2090, sharing 80.0% of its alleles with TX2090. Third, comparison of domesticated accessions to Acala Maxxa confirmed that domesticated accessions had nearly all alleles common with Acala Maxxa (of all domesticated accessions genotyped, no individual had more than 6.14% of its alleles different from Acala Maxxa and when considering all domesticated accessions together, only 17.7% of the 277 assays exhibited any TX2094 allele). An average heterozygosity of 2.43% was observed across all SNP assays with the highest heterozygosity of any assay being 15.2%. Of the 277 assays tested, 259 (93.5%) had a minor allele frequency of greater than 10% and 188 (67.9%) had a minor allele frequency of greater than 20%.
The results of the GT and GI SNP assays in the diversity panel of G. hirsutum were further inspected. 25 A-genome assays and 23 D-genome assays were included in the screening of 277 total assays. Across all accessions, A-genome assays identified 34.0% wild alleles and the D-genome assays identified 34.7% wild alleles. In the domesticated accessions, A-genome assays identified 7.1% wild alleles and D-genome assays identified 3.0% wild alleles. These results suggest that wild alleles are equally represented in both A- and D-genomes across the panel of other landraces and primitive cultivars. These assays also suggested a slight bias of wild alleles in the A-genome of cultivated cotton compared to the D-genome, though the limited number of assays detecting any wild alleles (9 A-genome and 7 D-genome assays total) in cultivated cotton prevented any broader assertions.
Genetic mapping of SNP assays
The resident genome of most EST SNP assays was identified a priori (GI) or was identified a priori and targeted (GT) during assay development. 100 of 348 EST SNP assays were placed in the genetic map. 81 of these assays had an a priori identification of their resident genome. Of these 81 assays, at least one was found in 32 (84%) of the 38 linkage groups, while at least two were found in 25 (66%) of the 38 linkage groups (Fig. 7). 74 of 81 assays (91%) resided in linkage groups with at least one other GI or GT assay. To determine whether the resident genome of these SNPs was accurately identified, linkage groups with multiple GI and GT SNP assays were examined for genome consensus. Seventy (94%) of the 74 assays that resided in linkage groups with at least one other GI/GT assay agreed with the consensus for the target genome. Of the 25 linkage groups with two or more GI/GT SNP assays, 21 (84%) perfectly agreed with their genome identification (Fig. 7). Of the four linkage groups with assays that disagreed, each case consisted of only two GI/GT SNP assays. Thus, of the 38 linkage groups in the map, 28 (74%) of these can be putatively assigned a genome based on these predictive SNP assays. These assignments suggest that 12 linkage groups (#1, 2, 5, 6, 7, 9, 18, 19, 21, 23, 28, and 30) are representative of the DT genome while 16 (#3, 4, 10, 11, 12, 13, 14, 15, 16, 17, 22, 24, 25, 29, 36, and 38) are representative of the AT genome.
In the GR-RSC sequence assemblies potential SSR markers were identified using the MISA v.1.0 Perl script (http://pgrc.ipk-gatersleben.de/misa) (Supplemental Fig. 3). The AT/TA class was the most abundant, similar to SSR abundance in other species (Varshney et al. 2002). Di-nucleotide repeats were the most common followed by tri-nucleotide repeats and the frequency of each repeat decreased as repeat length increased. As expected the number of detected repeats identified was also correlated with size of assembly. The assembly of both G. hirsutum and G. barbadense together contained the most SSRs, in part because it contained the highest number of reads. We report this discovery of additional SSR markers for cotton because SSRs continue to be broadly used in cotton research (Zhang et al. 2011; Gutiérrez et al. 2010; Lacape et al. 2009; Zhang et al. 2009; Lin et al. 2009; Rong et al. 2007).
SNP discovery and mapping
A narrow germplasm base coupled with the complexity of a tetraploid genome presented a significant challenge in identifying and developing functional SNP assays in cotton. Despite the difficulties, we successfully identified genome-specific SNP markers (validated by Mendelian segregation patterns in an F2 population) from both the GR-RSC and EST approaches and have shown that genome specificity (AT or DT) of EST SNP assays could be determined a priori via the inclusion of A2 and D5 diploid sequences in the EST assemblies. The SNPs identified in this study have a transition/transversion ratio similar to other plant genomes and we have shown that 361 of these SNPs to exhibit normal Mendelian inheritance expected in a segregating F2 population.
In addition to Mendelian segregation patterns, SNP assays based on these putative SNPs have been used to create an intra-specific map of G. hirsutum from a large segregating F2 population. The map covers 1,688 cM (37.5%) of the approximate 4,500 cM (Rong et al. 2004; Reinisch et al. 1994) recombination length of allotetraploid cotton. While this is not the largest intra-specific map to date in terms of cM, it is comparable to previous intra-specific maps (Zhang et al. 2009; Lin et al. 2009; Ulloa et al. 2002; Shen et al. 2005) and is the first map to be constructed in cotton exclusively with SNP-based markers. We have not attempted to associate linkage groups with specific chromosomes in this map, but the anticipated release of the diploid cotton genome sequences (G. arboreum and G. raimondii) within the next year, should allow us to unambiguously assign SNP loci to particular chromosomes. Considering the conversion rate of putative GR-RSC SNPs to function KASPar SNP assays (35.8%), we estimate that of the 11,834 SNPs we have identified within G. hirsutum in this study, 4,237 are expected to yield functional SNP assays. With additional assay development, these markers could provide the means to establish the first high-density linkage map of G. hirsutum based solely on SNP loci. SNP assays are an ideal marker choice as they represent the highest resolution molecular marker possible and are highly amenable to genotyping automation.
Previous work suggests that GR-RSC markers are evenly distributed along chromosomes (Maughan et al. 2009). The even distribution of GR-RSC markers is of particular interest as it has recently become apparent that many agronomically important genes are controlled by regulatory sequences located in non-genic portions of the genome (Elshire et al. 2011). Thus, our development of SNP assays has targeted both genic and non-genic portions of the cotton genome. Specifically, GR-RSC SNP assays have been shown to also access pericentric and centromeric regions of the genome in Arabidopsis (Maughan et al. 2010). In maize, approximately 21% of genes lie in pericentric regions but most of the recombination occurs outside of these regions (Gore et al. 2009). If gene distribution within the cotton genome proves to be similar to maize, GR-RSC SNP assays may prove a valuable complement to previously identified molecular markers.
Homoeolog specific markers
We attempted to target alleles in only one of the two genomes resident in the tetraploid nucleus through two different methods of SNP assay design. The first and simplest method for identifying SNPs in a tetraploid is to force the separate assembly of homoeologs (i.e., only sequences from the AT genome assemble together and only sequences from the DT genome assemble together) through the utilization of strict assembly parameters. Loose assembly parameters (default) lead to co-assembly of homoeologous sequences that confound the identification of true SNPs. Neither strict nor loose assembly parameters produced ideal assemblies for all genome reduction fragments as the amount of sequence divergence in the selected fragments was locally constrained. The set of strict assembly parameters used 97% sequence identity and 100 bp minimum overlap to force sequences from each genome to assemble separately (i.e. genome-specific contigs). In addition to these parameters, our conservative SNP identification method (8× coverage, 90% identity and 20% minor allele frequency) only considered a subset of all SNPs in the dataset in which we had high confidence. Co-assembly of highly similar homoeologous sequences also likely occurred even in this strict assembly but this type of contig was ignored during SNP discovery in the GR-RSC assemblies. These contigs were ignored because accurate identification of SNPs without diploid reference sequences was impossible. Without the diploid reference sequences, we were unable to distinguish between a SNP in a co-assembly of homoeologs and a heterozygous locus in a separate assembly of homoeologous sequences (Fig. 2). Thus, only SNP loci were used that were homozygous in Acala Maxxa, homozygous in TX2094 and had a different nucleotide between the two accessions.
In contrast, the EST dataset provided sufficient A2 and D5 diploid sequence data to create genomic sequence references for each of the tetraploid genomes, thus allowing us to assign specific tetraploid reads to the AT or DT genome. Individual reads within a co-assembled tetraploid contig were assigned to either the AT or DT genome by genome distinguishing SNPs matching bases in either the A2 or D5 diploid sequences (Fig. 2). Both the observance of expected Mendelian segregation ratios and the successful prediction of resident genomes (AT or DT) for greater than 94% of the GT/GI SNP assays supports the conclusion that tetraploid reads were correctly assigned to genomes using A2 and D5 diploid reference sequences. In a few cases, designed GT/GI assays failed to indicate a consensus genome for their linkage group. Possible explanations for these disagreements include bioinformatic errors due to paralogous assemblies, differences between the diploid A2 and D5 genomes and the tetraploid AT and DT genomes, or poorly mapped linkage groups containing markers from both genomes. As far as we know, this was the first report of large-scale design of genome-specific SNP markers in a polyploid plant.
Consideration for SNP assay development and utilization in cotton
While bioinformatic filters can identify thousands of putative SNPs, often only a subset can be successfully converted to functional marker assays due to the (1) simultaneous assay targeting of duplicate loci (paralogs or homoeologs), (2) local nucleotide limitations of primer design near the SNP, (3) proximity of the SNP to repetitive elements such as transposons, and (4) initial identification of false SNPs owing to sequencing errors and/or poor assembly. Our conversion rate of SNP assays was lower than initially anticipated. In amaranth, a diploid species, a conversion rate of nearly 70% was observed using a GR-RSC-based SNP discovery method (Maughan et al. 2009). The GR-RSC and EST SNP identification methods in this study had conversion rates of 35.8 and 31.3%, respectively. The difference between these two conversion rates is likely a difference in ploidy levels between the two species. In cotton, many of the ‘failed’ assays could be amplifying or partially amplifying segregating loci on both resident genomes resulting in uninterpretable cluster patterns. In designing the EST SNP assays, 156 of the assays were specifically chosen at SNP loci where the flanking sequences had diverged between the AT and DT genomes. These SNP assays were developed to test whether a design of genome specific primers could improve marker success rate. We observed similar conversion rates between the GR-RSC markers and both types of EST markers, suggesting that regardless of the source of the putative SNPs (EST or GR-RSC) or genome specificity of the KASP primers only subtle improvement in SNP assay conversion rates may be achieved in a polyploid genome.
We characterized these SNP assays in a diverse germplasm panel of G. hirsutum to ascertain their broader utility for trait introgression via marker assisted selection analysis of the germplasm panel on a selection of our SNP markers showed that Acala Maxxa and TX2094 were characteristics of domestic and wild varieties, respectively, and that few wild alleles exist in cultivated varieties of cotton. It also demonstrated that the narrow germplasm base of cotton could be broadened dramatically via the introgression of wild alleles into the cultivated cotton germplasm. We expect the putative SNPs identified within G. barbadense (nearly 1,700) to possess similar utility in expanding the germplasm base of G. barbadense.
We report the discovery of over 151,000 putative SNPs in non-transcribed sequences of allotetraploid cotton. These polymorphisms were identified using a GR-RSC technique combined with 454 FLX high throughput sequencing. These SNPs represent both intra- and inter-specific SNPs identified in accessions of G. hirsutum and G. barbadense. We also identified 4,327 SNPs from a recent assembly of cotton ESTs. For many EST-based SNPs, we identified its resident genome (‘AT’ or ‘DT’) using diploid genome sequence data. Of these putative SNPs, we developed 1,052 KASPar-based SNP marker assays and evaluated the broad utility of 277 of them using a diverse panel of G. hirsutum accessions. Finally, we constructed the first genetic linkage map of G. hirsutum based entirely on 346 SNP markers. Hundreds of putative microsatellites were also identified.
We thank Cotton Incorporated, the National Science Foundation Plant Genome Program, and BYU Mentored Environment grants for their generous support. We thank Jonathan Wendel and Armel Salmon for construction of the cotton diversity panel and its corresponding DNA samples. We also thank undergraduate students Zach Liechty, Elisabeth Svedin, Prabin Bajgain, and Justin Page for their technical assistance.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.