Introduction

From the perspective of germplasm utilization, genetic diversity is viewed in terms of genes that can be utilized for the improvement of the cultivated species. In this sense, more emphasis is placed on genes related to environmental fitness and physiological processes than on the morphological characteristics important in taxonomic classification. Thus, it is imperative to take advantage of the available gene pools for cotton improvement to meet demands for enhanced crop productivity while simultaneously increasing the fitness of the crop in response to changing environmental factors. The domesticated and wild species of the Gossypium genus represent an enormous genetic storehouse for potential exploitation by cotton breeders and geneticists. However, genetic diversity within Gossypium germplasm collections is underutilized as, in the U.S., less than 1% of the genetic diversity is exploited (Esbroeck and Bowman 1998; Wallace et al. 2009). This is in part due to the shortage of effective genomic tools that can discern and define the patterns of variability available to the breeders and geneticists.

The Gossypium genus has about 50 species, including the four domesticated species (G. arboreum L., G. barbadense L., G. herbaceum L., and G. hirsutum L.), that embody considerable genetic diversity across a wide geographic range covering most tropical and subtropical regions of the world (Percival et al. 1999; Wendel and Cronn 2003; Campbell et al. 2010). Based on pairing relationships of 13 chromosomes, the diploid Gossypium species are grouped into A through G and K genomes (Endrizzi et al. 1985; Stewart 1994). The allotetraploid species are the products of a presumed single polyploidization event between the A-genome and D-genome diploids that occurred around 1–2 million years ago (Wendel and Cronn 2003). The two diploid species G. arboreum and G. herbaceum (n = x = 13) of the A-genome are known as Old World Asiatic cottons, while two allotetraploid species G. hirsutum and G. barbadense (n = 2x = 26) of the AD-genome are known as New World cottons (Lee 1984). The number of 26 disomic pairing gametic chromosomes in allotetraploid cotton species is the largest among all major crop plant species (Yu and Kohel 2001; Stelly et al. 2005). Cultivated cotton is the world’s leading fiber crop and the second most important oilseed crop.

As DNA markers, genome maps, and other molecular tools are being developed for cotton, it becomes possible to look at the genotypes (not just phenotypes) of accessions of the Gossypium germplasm collections to identify redundancies and duplications, and to provide guidance for evaluation and selection. Many novel alleles can thus be discovered that are otherwise buried in the background noise of the genome, but may contribute to the well-being of the cotton plant (Tanksley and McCouch 1997). To make significant and timely advances in genetic improvement of cotton, portable DNA markers based on polymerase chain reaction (PCR) are needed for the tetraploid genome of cultivated cottons. Over the last decade, thousands of PCR-based DNA markers including simple sequence repeat (SSR) markers were identified and many were mapped in the cotton genome (Blenda et al. 2006; Guo et al. 2007; Lacape et al. 2009; Yu et al. 2011; Yu et al. 2012). These cotton SSR markers were developed either from random enriched small genomic clones, large insert bacterial artificial chromosome (BAC) clones (or physical contigs) or expressed sequence tags (ESTs) of Gossypium species. Ad hoc studies to characterize Gossypium germplasm were conducted that used various sets of mapped and/or unmapped SSR markers (Liu et al. 2006; Lacape et al. 2007; Abdurakhmonov et al. 2008, 2009). Subsets of different collections that represent the range of diversity of G. hirsutum and outliers of other Gossypium species were subjected to molecular characterization.

Over the course of such studies, an understanding of the germplasm structure emerged and valuable information was generated toward developing a core set of markers. However, comparative analyzes and systematic characterizations of the genus were made difficult and potential information was lost due to inability to make direct comparisons between studies and the inability to combine data sets. Fragmented data and information on germplasm characterization should be coordinated and integrated as a whole. Coordination with the cotton research community would require a consensus on the common set of portable DNA markers (Kohel and Yu 2002). A strategy was proposed that at least four markers would be necessary to cover both arms of individual chromosomes (Fig. 1). The objective of this study was to identify and validate an appropriate set of core SSR markers to serve as initial molecular descriptors to characterize Gossypium germplasm collections and as desirable genomic tools for other studies including marker-assisted selection in cotton.

Fig. 1
figure 1

Schematic illustration of core DNA markers distributed on the 26 chromosomes of tetraploid cotton genomes (Kohel and Yu 2002)

Materials and methods

Plant materials and DNA extraction

Table 1 shows the 12 cultivated and exotic cotton genotypes representing six Gossypium species that were assembled and maintained as individual plants in a USDA-ARS greenhouse in College Station, Texas, USA. The 12 diverse genotypes provide the cotton research community with a manageable reference panel of core Gossypium germplasm, which are the standards for the cotton marker database (CMD) and other cotton genomic studies (Yu 2004; Blenda et al. 2006). The cotton genotype panel consists of the tetraploid genetic standards for G. hirsutum (upland inbred TM-1) and G. barbadense (double haploid Pima 3–79), which were also used as parents for a recombinant inbred (RI) population. In addition, the panel includes two wild tetraploid introgression species (G. tomentosum Nuttall ex Seemann and G. mustelinum Miers ex Watt). The panel also includes five commercially grown G. hirsutum cultivars, i.e., Acala Maxxa (PVP 9000168), Delta and Pine Land (DPL) 458BR (PVP 9800206), Paymaster (PM) 1218BR (PVP 200000213), Fibermax 832 (PVP 9800258) and Stoneville (STV) 4892BR (PVP 200000253), representing major cottonseed companies in the U.S.; one G. barbadense cultivar (Pima S-6); and two ancestral diploid genome representatives (G. arboreum, A2-8 and G. raimondii Ulbrich, D5-3). This panel represents a balanced diversity of Gossypium germplasm that are important in cotton cultivar improvement and that are well accepted by the cotton research community (Yu 2004; Blenda et al. 2006).

Table 1 A reference panel of Gossypium germplasm standardized for cotton marker database (CMD) (Yu 2004)

Genomic DNA was extracted from fresh young leaf tissue of individual cotton plants grown in the greenhouse following the modified CTAB DNA extraction procedure as described by Kohel et al. (2001) and Yu et al. (2012).

SSR primers and PCR assays

The SSR primer pairs used in this study were developed by different research groups of the cotton community (Blenda et al. 2006). A total of 105 pairs of primers were identified that represented ten diverse sources (BNL, CIR, DPL, GH, JESPR, MGHES, MUSB, NAU, STV, and TMB) of cotton SSR discovery efforts (Table 2). SSRs exhibit minimal cross interaction among the primer pairs, least overlapping of PCR products or fragment sizes, good binding capacity to amplify genomic DNA, sufficient power to detect DNA polymorphism of Gossypium species, and even distribution among the 26 chromosomes of the tetraploid cotton. Each chromosome was represented by four SSR loci except for the genetically longest chromosome (c05) that had five SSR loci on the basis of the genetic map developed from the G. hirsutum TM-1/G. barbadense 3–79 RIL population(Yu et al. 2012). The sequence of individual primer pairs and the source clone for each SSR can be found in the CMD (http://www.cottonmarker.org/).

Table 2 Source and number of cotton SSR primer pairs identified for the core marker set (http://www.cottonmarker.org/)

Each of the 105 individual forward primers was labeled at the 5′ end with one of the three fluorescent dyes (FAM, HEX, or NED) to allow the creation of 35 triplex PCR bin sets. Three pairs of primers with different dyes were multiplexed in each PCR reaction according to Fang et al. (2010). All PCR experiments were conducted using the same protocol and temperature profile described by Fang et al. (2010). Amplified DNA products were separated with GeneScan-500 Rox® as an internal DNA size standard in capillary electrophoresis of an automated Genetic Analyzer 3730xl (Applied Biosystems/Life Technology, Foster City, CA, USA).

Data analysis

Individual PCR fragments of each cotton accession were recorded in actual base pairs (bp) with the ABI GeneMapper 4.0 software. The polymorphism information content (PIC) values of each SSR marker was calculated according to Botstein et al. (1980) using the PowerMarker 3.0 program (Liu and Muse 2005). SSR products were treated as presence or absence when calculating a similarity coefficient. The Jaccard coefficient was used to compute a similarity matrix using a server running the program DendroUPGMA (http://genomes.urv.es/UPGMA/) (Garcia-Vallvé et al. 1999). This matrix was used to construct a dendrogram using the program PhyloWidget (http://www.phylowidget.org/) (Jordan and Piel 2008).

Results and discussion

Genome distribution of a core SSR marker set

A recently developed tetraploid genetic linkage map with robust framework of SSR-linked loci (Yu et al. 2012) was used to identify the SSR markers used in this study. Four SSR markers were identified from each one of the 26 cotton chromosomes, except for chromosome 5 (c05). Based on the genetic map of Yu et al. (2012), c05 was the longest in genetic distance and thus was represented with five SSR markers. Table 3 shows the chromosome location of each SSR marker as placed on cotton linkage groups/chromosomes by Yu et al. (2012). An approximate 30 cM interval between SSRs was selected as the criteria to ensure even distribution of the markers on each chromosome. Even though telomeric regions of plant chromosomes are primarily heterochromatic, they are important not only in maintaining the integrity of chromosome structure but also in harboring the genes of interest (Gill et al. 1993; Fang et al. 2010). Six markers were selected from these telomeric regions, including TMB1648 on c04, JESPR119 on c06, JESPR291 on c08, TMB2295 on c18, BNL1047 and DPL0520 on c25 (Table 3). Other criteria to balance the marker distribution in the genome included single-copy versus multiple-copy markers and recombination-rich versus recombination-poor regions for potential tapping into gene islands that may be buried in the cotton genome (Xu et al. 2008).

Table 3 Genomic location, PIC value and triplex set of cotton core SSR markers

Polymorphism information content (PIC) values of core SSR markers

Each of the SSR markers from the core set was examined for its PCR specificity and DNA polymorphism on the CMD panel of 12 core Gossypium genotypes (Table 1). Analysis of the CMD panel and other Gossypium germplasm accessions demonstrated the ability of the core markers to reveal genetic diversity within the collections (Abdurakhmonov et al. 2008). Table 3 shows the PIC values of each core SSR marker among all the 12 Gossypium genotypes and within the six G. hirsutum cultivars and accessions. As expected, the core SSR marker set detected higher levels of genetic diversity among the Gossypium species than within the G. hirsutum genotypes. PIC values of the core SSR marker set on the complete CMD panel ranged from 0.23 to 0.88, with an average value of 0.65. The majority (90) of SSRs from the 105 core SSR marker set were highly informative (PIC >0.50) and 13 SSR markers were moderately informative (0.25 < PIC < 0.50), while only two were slightly informative (PIC <0.25). PIC values of the core SSR marker set on six G. hirsutum genotypes ranged from 0 to 0.84, with the average value of 0.29. Thirty-seven markers specifically detected genetic diversity among the six Gossypium species but not within the six G. hirsutum cultivars (PIC = 0). Of the 68 markers that detected G. hirsutum diversity, 27 were highly informative (PIC >0.50), 30 moderately informative (0.25 < PIC < 0.50) and 11 slightly informative (PIC <0.25). These results can be used to guide researchers when they plan individual studies with this core SSR marker set. For example, SSR markers such as BNL3545 on c02 and BNL1531 on c16 had very high PIC values not only among the six Gossypium species (0.88 and 0.76, respectively) but also within the G. hirsutum species (0.84 and 0.77, respectively). Such highly informative SSR markers usually resulted from multiple copies of PCR products, and would be of particular interest in identification of unique DNA profiles and potentially novel genes within the cultivated cottons (G. hirsutum) (Botstein et al. 1980; Tanksley and McCouch 1997).

We compared our results with those obtained by Lacape et al. (2007). They used 184 SSR markers to analyze 47 Gossypium genotypes that included four tetraploid species. Of the 184 markers used in their study, 18 were also included in our core set (Table 3). In their study, the average PIC value of these 18 markers was 0.67 with the range from 0.34 (marker BNL2544) to 0.86 (JESPR153a). In our present research, the average PIC value of the same 18 markers was 0.66 with the range from 0.39 (marker BNL2544) to 0.82 (JESPR153). The results of these two studies were similar although Lacape et al. (2007) did not calculate the PIC values of each marker among G. hirsutum accessions.

Phylogenetic relationship of the CMD panel as revealed by the core SSR marker set

The assembly of the 12 genotype CMD panel has provided the cotton research community with a Gossypium germplasm reference (Yu 2004; Blenda et al. 2006). However, knowledge about the phylogenetic relationship of the 12 cotton genotypes is needed to better interpret research results with this genetic reference set. Using the 105 core SSR marker set, a matrix was generated to reveal the genetic similarity between any two members of the CMD panel (Supplemental Table 1). A dendrogram was constructed to display the genetic similarities within the CMD panel (Fig. 2). While all six G. hirsutum genotypes (AD1) were clustered in one major clade, DPL 458BR and STV 4892BR were most similar (0.755) while Acala Maxxa and PM 1218BR were the most different (0.575) among the G. hirsutum cultivars. TM-1, the G. hirsutum genetic standard, is an inbred derived from DPL 14, a highly productive and widely adapted cultivar (Kohel et al. 1970). Figure 2 shows that TM-1 was located within proximity to the branch of DPL 458BR and STV 4892BR that had a TM-1 similarity coefficient of 0.683 and 0.724, respectively. Two G. barbadense cottons (AD2), the genetic standard 3–79 and cultivar Pima S-6, were very closely related to each other with similarity coefficient of 0.855. Two diploid species representatives, G. arboreum A2-8 and G. raimondii D5-3, were the probable progenitors of the A and D subgenomes which formed the tetraploid cottons (Wendel and Cronn 2003). However, G. raimondii D5-3 (similarity coefficients with the tetraploid cottons all under 0.1) exhibited a more distant relationship than G. arboreum A2-8 (similarity coefficients with the tetraploid cottons all above 0.1) to all other CMD panel members (Supplemental Table 1). Two wild tetraploid species, G. tomentosum (AD3) and G. mustelinum (AD4), also were clearly distinguishable in forming their own branches of the dendrogram. Results from this analysis reflect the original considerations of the CMD panel assembly and confirm the excellent value of the 105 core SSR marker set for use in cotton research.

Fig. 2
figure 2

Unrooted dendrogram for 12 Gossypium genotypes of the CMD panel constructed with the 105 core SSR markers using the UPGMA algorithm (Jordan and Piel 2008)

Future implications of core DNA markers

The Gossypium genus comprises species with differing ploidy levels and presents a high degree of variability, from highly improved allotetraploid species to wild diploid forms. Research utilizing Gossypium germplasm is essential, as this is a complex genus. Cotton improvement programs, either public or private, use Gossypium germplasm in their programs with the ultimate goal of developing breeding lines or cultivars to increase lint yield and fiber quality among other traits. The variability in Gossypium germplasm has only recently begun to be tapped as a source of beneficial characteristics because there is still a shortage of effective genomic tools compared with other major crops. This report represents a major effort to improve the accuracy and efficiency of molecular characterization of Gossypium germplasm. This core SSR marker set, representing the efforts of many research groups, is portable, and thus is readily available to any researchers who wish to use it in their own research facilities.

Although additional portable markers are needed for detailed characterization, all evaluations of cotton germplasm across different gene pools and germplasm sources can be shared and pooled into a single database for systematic analysis with the 105 core SSR marker set presented in this report. Cotton germplasm accessions with unique DNA profiles can be investigated in detail with additional DNA markers for mining new genes of interest. Future augmentation of this core SSR set can be made with newly developed single nucleotide polymorphism (SNP) markers upon appropriate characterization and mapping in the cotton genomes (Van Deynze et al. 2009). The development of an appropriate set of portable core DNA markers that are highly informative, evenly distributed in the genome, able to generate unambiguous PCR products, and able to be effectively multiplexed offers a previously unavailable opportunity to integrate molecular characterization of the U.S. cotton germplasm collection and other Gossypium collections around the world into a single cohesive international effort to assess Gossypium diversity.