Introduction

In total 50 species of the genus Gossypium (Fryxell et al. 1992) are representing eight different genomes A through G and K (Endrizzi et al. 1984; Stewart 1995), of these 45 are diploids (2n = 26) while five are allotetraploids (2n = 52). Out of these, four species, G. arboreum, G. herbaceum, G. hirsutum and G. barbadense are cultivated in different parts of the world. Allotetraploids evolved about 1.1–1.9 million years ago by hybridization of A-genome with D-genome [only sequenced genome (Paterson et al. 2012; Wang et al. 2012)], followed by duplication (Beasley 1940; Wendel et al. 1992). Among tetraploids, G. hirsutum is a widely distributed species comprising seven races (6 domesticated, “marie-galante”, “punctatum”, “richmondi”, “morrillii”, “palmeri”, and “latifolium”) and one wild (“yucatanense”) (Hutchinson 1951). First cultivated races are “punctatum” and “latifolium” (Lacape et al. 2007). Most of the cultivated G. hirsutum varieties are derived from “latifolium” (Hutchinson 1951). G. hirsutum “marie-galante” is thought to be hybrid of G. hirsutum and G. barbadense (Stephens 1967; Percival and Kohel 1990).

After the revolution in textile industry, G. hirsutum and G. barbadense replaced major cultivated area under diploid cultivated species especially in Asia (Iqbal et al. 2001; Rahman et al. 2008). G. hirsutum is predominantly cultivated on ~90 % of the total cotton area (Zhang et al. 2005) whereas; limited area is under cultivation of G. barbadense, known for producing high quality lint fiber (Abdalla et al. 2001; Rahman et al. 2012). Like other cultivated crop species, all the cultivated tetraploids and diploids have a narrow genetic base within their respective gene pool (Liu et al. 2005; Rahman et al. 2008, 2011; Shaheen et al. 2010) because of developing cultivars from few genotypes (Rahman et al. 2002b, 2005a). This phenomenon has been demonstrated in multiple investigations (Liu et al. 2000; Gutierrez et al. 2002; Lacape et al. 2007; Rahman et al. 2002b). Second, limited use of wild cotton species for breeding new cultivars is another major reason for the narrow genetic base, a major cause of stagnation in cotton productivity worldwide (Rahman et al. 2002a; Zhang et al. 2005).

The genus Gossypium has sufficient genetic repository for many important traits like fiber strength, high fiber yield, high tolerance/immunity to viral disease. For example, A-genome species are immune to cotton leaf curl disease—causes substantial yield losses in Pakistan (Rahman et al. 2005b, 2008, 2011, 2012). It is, therefore, vital to assess the genetic diversity and phylogenetic relationship among the available cotton species as a mandatory requirement before utilizing the species in molecular breeding programs. Numerous methods including conventional and DNA-based conventional marker systems have been employed for estimating the extent of genetic diversity among the cotton species (Menzel 1954; Brubaker and Wendel 1994, 2000; Abdalla et al. 2001; Gutierrez et al. 2002; Guo et al. 2006; Lacape et al. 2007; Rahman et al. 2009; Kalivas et al. 2011). Earlier, RAPD markers were applied for working out the phylogeny of 31 cotton species (Khan et al. 2000). RAPDs are handicapped because of reproducibility concerns. After couple of years, SSRs were applied on 22 diploid cotton species, thus are not representing most of the important cotton species using SSRs (Guo et al. 2006). EST-SSRs (SSRs derived from coding sequences are called EST-SSRs, Saha et al. 2003) have more potential to identify changes in the genes accumulated during domestication (Wang et al. 2007). In contrary to EST-SSRs, genomic SSRs (gSSRs) are highly polymorphic, tend to be widely dispersed throughout the genome but have less transferability across other species (Peakall et al. 1998; Kuleung et al. 2004). In contrast, EST-SSRs are less informative (Decroocq et al. 2003) than the gSSRs because these are part of transcribed regions thus are more conserved (Cho et al. 2000; Thiel et al. 2003).

In the present studies, two types of markers: gSSRs and EST-SSRs were used (1) to find genetic diversity and phylogenetic relationship among both the diploid (wild, cultivated) and tetraploid (wild, cultivated) species/races (2) to compare the usefulness of these markers for calculating the genetic divergence and their phylogenetic relationship. The information generated will not only be helpful in validating the existing phylogenetic relationship about the genus Gossypium but can also be utilized for introgressing the novel traits from the wild accession to the cultivated species, an effective strategy to counter the negative impact of climate change.

Materials and methods

Plant material

Thirty-six Gossypium species; 24 diploid species representing seven genomes (3 A-, 10 D-, 3 E-, 1 C-, 3 G-, 3 B-, and 1 F-genome), 5 tetraploid species (AD-genome) and 7 races of G. hirsutum (“morilii”, “palmeri”, “marie-galante”, “yucatanense”, “punctatum”, “latifolium” and “lanceolatum”) have been explored in this study (Fig. 1). Leaf samples were taken from Central Cotton Research Institute (CCRI) Multan, Pakistan and the National Institute for Biotechnology and Genetic Engineering (NIBGE) Faisalabad, Pakistan. The genomic DNA of G. hirsutum “yucatanense”, G. capitis-viridis, G. longicalyx Hutchinson & Lee and G. australe F. Muell. was provided by Prof J. McD. Stewart (University of Arkansas, Fayetteville, Arkansas, USA).

Fig. 1
figure 1

Location of countries on World map from which cotton species have been collected for current study 1. G. herbaceum (A1), Southern Africa & Arabian Peninsula 2. G.herbaceum var. africanum (A1), Africa 3. G. arboreum (A2), India, Pakistan 4. G. thurberi (D1), Mexico, Arizona 5. G. klotzschianum (D3-k), Galapagos Islands 6. G. herkensii (D2-2), Mexico 7. G. davidsonii (D3-d), Mexico 8. G. aridum (D4), Mexico 9. G. raimondii (D5), Peru 10. G. gossypioides (D6), Mexico 11. G. lobatum (D7), Mexico 12. G. trilobum (D8), Mexico 13. G. laxum (D9), Mexico 14. G. hirsutum cv NIBGE 115, (AD1) Central & northern America, Caribbean, Pacific 15. G. tomentosum (AD3), Hawaii 16. G. mustilinum (AD4), Brazile 17. G. barbadense (AD2), Hawaii 18. G. darwinii (AD5), Galapagos 19. G. hirsutum “marie-galante” (AD), Caribbean,Central America 20. G. hirsutum “latifolium” (AD), Southern Mexico/Guatemala 21. G. hirsutum “morrillii” (AD), Mexico 22. G. hirsutum “palmeri” (AD), Mexico 23. G. hirsutum “punctatum” (AD), Cameron 24. G. hirsutum ‘‘yucatenense” (AD), Guadeloupe 25. G. lanceolatum (AD), Mexican states of Oaxaca& Guerrero 26. G. anomalum (B1), Africa 27. G. barbosanum (B2), Africa 28. G. capitis-viridis (B3), Africa Cape Verde Island 29. G. somalense (E2), Africa/Arabia 30. G. stocksii (E1), Arabian penunsula 31. G. longicalyx (F), East Africa 32. G. incanum (E4), Arabian Penunsula 33. G. robinsonii (C2), Australia 34. G. australe (G2), Australia 35. G. nelsonii (G3), Australia 36. G. bickii (G1), Australia

Extraction of DNA and SSR analysis

For genomic DNA isolation, young leaves from five plants of each was collected from the field and stored in liquid nitrogen. The samples were ground in liquid nitrogen by following a CATB method with little modification (Doyle and Doyle 1987). DNA concentration was measured using 200 Fluorometer DyNA Quant (Hoefer USA). Quality was checked by running 30 ng of genomic DNA of each species in 0.8 % agarose gel. We too validated the concentration of the genomic DNA.

A total of 100 SSR primer pairs, 50 each of BNL (representing gSSRs) and MGHES (representing EST-SSRs derived from fiber tissues of G. hirsutum) series were used for the analysis. Sequences of these primers were downloaded from http://www.mainlab.clemson.edu/cmd/primer, and got synthesized from GeneLink, USA. Polymerase chain reaction (PCR) was carried out in gradient thermal cycler, (Eppendorf Germany). Total reaction mixture of 20 μl contained 1 × Fermentas Taq buffer, 2.5 mM MgCl2, 0.25 mM dNTPs each, 0.15 μM primers, one unit Taq DNA Polymerase (Fermentas, USA) and 50 ng genomic DNA as a template. The programming of thermal cycler was adjusted for 5 min at 94 °C for 1 cycle and 35 cycles each of 94 °C for 30 s for denaturation, primer annealing (50–61 °C) for 30 s, 72 °C for 30 s for extension. The final extension temperature at 72 °C was granted for 10 min. The PCR products were fractionated in 4 % metaphore-agarose gels (Cambrex Corporation, USA). The gel was stained with ethidium bromide which was then visualized in UV (ultraviolet light) and scored manually.

Statistical analysis

The gels were scored by assigning ‘1’ for the presence of amplified allele while we scored ‘0’ for the absence of a fragment. The reactions were repeated twice to confirm the absence of fragment. Few loci could not be amplified in genotypes were scored as ‘null’ alleles. Bright fragments were considered for scoring. We did not consider 25 primers (16 gSSRs and 9 EST-SSRs) in the final analysis because of the poor amplification profile. Polymorphism information content (PIC) was calculated for determining the diversity of each SSR locus (Anderson et al. 1993). Following formula was used for PIC values calculation.

$$ {\text{PIC}} = 1 - \sum {{\text{P}}ij^{2} } $$

Pij is the frequency of the jth allele for the ith locus summed across all alleles for the locus.

With the help of scoring profile similarity matrix was calculate (Nei and Li 1979). These similarity coefficients were used to construct the phylogenetic tree using unweighted pair group method of arithmetic means (UPGMA). We used the PAUP version 4.4 software. We developed three dendrogram using gSSRs, EST-SSRs and combined data sets. Amplification percentage was calculated by the formula described by Kuleung et al. (2004).

$$ \% \,{\text{age}}\,{\text{of}}\,{\text{amplification}} = \left( {{\text{no}}.\,{\text{of}}\,{\text{amplified}}\,{\text{markers}} \times 100/{\text{total}}\,{\text{no}}.\,{\text{of}}\,{\text{markers}}} \right) $$

Similarly, transferability of each SSR marker was calculated as the percentage of amplified products in all the cotton species. We too manually estimated the correlation between the repeat type and the polymorphism rate.

The frequency distribution of the alleles among all the cotton species was calculated manually on MS excel. In this regard, we divided it in 10 different categories, i.e., alleles with frequency of 0.1 or less, 0.19 or less, 0.29 or less, 0.39 or less, 0.49 or less, 0.59 or less, 0.69 or less, 0.79 or less, 0.89 or less 0.99 or less.

The association between the polymorphism rates with tandem repeats number was also calculated manually on the basis of the PIC values given in Table 1.

Table 1 Markers with repeat motif type, number of alleles, frequencies and PIC value

Results

Microsatellite polymorphism

A total of 36 Gossypium species/races were investigated using 75 SSR markers, amplified 87 loci. Out of these, 10 (MGHES-19, MGHES-33, MGHES-38A, MGHES-41, MGHES-60, BNL-448, BNL-1350, BNL-1878, BNL-3408 and BNL-3646) amplified two loci while one primer BNL-3955 amplified three loci. A total of 73 SSR primer pairs (97.70 %) were found polymorphic, while only two primers i.e., MGHES-13 and MGHES-17 were monomorphic. Out of the 73 polymorphic primers, 26 (34.66 %) were polymorphic because of amplifying null alleles in few species while 47 (62.66 %) were polymorphic because of amplifying alleles of different size. The range of amplified fragments was 70 bp for MGHES-12 to 700 bp for MGHES-07. A total of 135 alleles were amplified. Range of alleles detected on single locus was 1–5 with average alleles (2.87) per locus. Maximum number of alleles (5) were amplified by each of the genomic SSRs (gSSRs), BNL-1878, BNL-2691, BNL-3955 and BNL-3985. Relatively, higher number of alleles were amplified in tetraploids (two alleles per locus) than the diploids (1.36 alleles per locus).

The frequency distribution of the 135 alleles is shown in Fig. 2. The frequencies of alleles ranged from 0.025 to 1 having average frequency 0.469 (Table 1). A total of 18 alleles (16.51 %) appeared with 0.10 or lower frequencies, whereas 29 (26.6 %) appeared with a frequency of 0.99 or higher frequencies. None allelic variations were observed in MGHES-13 and MGHES-17 loci and were amplified in all cotton species.

Fig. 2
figure 2

Frequency distribution of 135 alleles in 36 cotton species/landraces

In this study, the average value of polymorphism information content (PIC) was 0.50, with the highest value 0.882 for MGHES-27 and lowest value 0.11 for BNL-3895 (Table 1). Genomic SSRs showed higher PIC value (average value 0.35) compared to the EST-SSRs (average 0.291). Diploids exhibited high (0.30) while tetraploids depicted low (0.21) average PIC values.

Genetic characterization

In the present study, 22 informative SSRs, 11 BNLs and 11 MGHES (Table 2) with PIC ≥ 0.5 were identified which can distinguish all the Gossypium species. All useful gSSRs (BNLs) contained di-nucleotide motif whereas 63.63 % of EST-SSRs (MGHES) contained tri and 27.27 % had di-, tetra- and penta-nucleotide motif. However, location and or position of SSRs on chromosome (either proximal to centromere or near distal end) has no effect on polymorphism information content (Table 2), while polymorphism rate was found positively correlated with tandem repeats number (Table 1).

Table 2 Most informative SSRs with their position on chromosome

Transferability of SSRs across Gossypium genomes and genome specificity.

Out of the 75 SSRs, 22 (29.33 %) produced amplicons in all the 36 species/races. We did not find any association between the repeat motif type with rate of transferability. Out of the total amplified fragments (75 SSRs × 36 species = 2,700), 44.16 % were found in more than one genome group. Four primers BNL-1350, BNL-3147, BNL-3065 and BNL-3558 produced the least number of species whereas MGHES-15, MGHES-17, MGHES-21, MGHES-26, MGHES-28, MGHES-30, MGHES-52, BNL-448, BNL-3672, BNL-3793 and BNL-3985 amplified fragments in most species. Among diploids, the species belonging to A-, B-, F-, and E-genomes showed high transferability rate while D-genome species exhibited low transferability rate (Table 3). We identified 15 genome- or species-specific primers (Table 4). None of the species belonging to A-, B-, C- and G-genome were amplified with BNL-3147, similarly none of the species of A- and AD-genome were amplified with MGHES-44 thus BNL-3147 can be used as A-, B-, C- and G-genome negative and MGHES-44 can be used as A- and D-genome negative. BNL-3888 did not amplify any of the D- and E-genome species which can be used as D- and E-genome negative. MGHES-16, BNL-1053 and BNL-1359 did not amplify in G. trilobum (DC) Skovst. BNL-3482 did not amplify in G. aridum (Rose & Standl.) Skovst. Two EST-SSRs MGHES-20 and MGHES-21 could not amplify the genomic DNA of G. darwinii while MGHES-15 could not amplify in G. aridum and G. trilobum (Table 4). Thus, this set of primers can be utilized as species-specific primers.

Table 3 Transferability of G. hirsutum derivative SSRs in other Gossypium species/genomes
Table 4 Genome and species-specific amplification features of SSRs

Microsatellite performance among diploid (A and D) and tetraploid (AD) genome species

In the present investigation, 12 primer pairs (MGHES-12, MGHES-22, BNL-1878, BNL-2449, BNL-2634, BNL-2691, BNL-3147, BNL-3103, BNL-3408, BNL-3793, BNL-3955 and BNL-3985) did not amplify A-genome species while, these primers produced some private alleles in D- and AD-genome species, indicating specificity of these primers for D-genome.

The sizes of many amplicons of tetraploids were different from diploids (A, D) (Fig. 3). The size of amplified fragments in all the A-, D- and AD-genome species was in the range of 101–700 bp. However, the percentage of fragments within 101–300 bp in AD-genome species was higher than that of A- and D-genome species. All the A-genome species produced relatively bigger fragments sizes between 301 and 400 bp (Fig. 3).

Fig. 3
figure 3

Distribution of fragment sizes amplified by SSRs

A- and D-genome species relationship with AD-genome species

Gossypium herbaceum showed 0.661 and 0.624 genetic similarity with G. hirsutum and G. barbadense, respectively. While G. arboreum was found 0.57 and 0.63, genetically similar with G. hirsutum and G. barbadense, respectively (Table 5). Among D-genome species, G. raimondii was more genetically close to G. hirsutum (0.642) and G. barbadense (0.667). Average coefficients of genetic similarity of diploid (A-, D-) cotton species (11 in number) with tetraploid (AD) species (five in number) were in the range of 0.56–0.64 (Table 5). G. raimondii showed the highest (0.652) while G. lobatum Gentry showed lowest (0.566) mean genetic similarity with the five tetraploid species.

Table 5 Genetic similarity coefficient between tetraploid species and A-/D-genome species

Genetic similarity among diploid and tetraploid cotton species with EST and gSSRs

Gossypium arboreum (A2) and G. herbaceum (A1) showed the maximum genetic similarity (0.89) followed by G. barbosanum Phillips & Clement (B2) and G. capitis-viridis (B3). While the lowest genetic similarity coefficient (0.50) was observed between G. herbaceum var. africanum (A1) and G. robinsonii (C2). Among diploid species average genetic similarity coefficient was 0.67, G. thurberi Tod. (D1) showed highest (0.71) while, G. robinsonii (C2) showed lowest average similarity (0.62) to all other diploid species.

Genetic similarity coefficients between tetraploids species/races were in the range of 0.62–0.85 (average 0.73). Least genetic similarity (0.62) was found between G. hirsutum “yucatanense” and G. mustilinum Miers ex G. Watt while maximum genetic similarity (0.85) was found between G. darwinii and G. barbadense as well as between G. tomentosum Nutt. ex Seem. and G. hirsutum (0.85). The species G. hirsutum “punctatum” showed the highest genetic dissimilarity with the other tetraploids. Among the races, G. hirsutum “morrillii” Cook & Hubb. and G. hirsutum “punctatum” Schumach. showed close relatedness with G. hirsutum “palmeri” (Watt) Wouters and G. hirsutum “yucatanense, respectively.

Phylogenetic study of 36 cotton species with combined data of gSSRs and EST-SSRs

Average genetic similarity coefficient of 36 Gossypium species/races was 0.64, with a range of 0.49–0.89 (Table 6). A dendrogram (Fig. 4) was generated using these genetic similarity coefficients and it was found that species were grouped in three major clusters ‘A’, ‘B’ and ‘C’. The major cluster ‘A’ consisted of two subclusters (a1, a2). All A-genome species were grouped in the subcluster ‘a1’ and all tetraploid species/races were grouped in ‘a2’ subcluster. Among allotetraploids, G. barbadense showed close relatedness with G. darwinii. The major cluster ‘B’ comprised of 10 species, two subclusters ‘b1’ and ‘b2’. All B- and E-genome species (6 in number) were grouped in subcluster ‘b1’ while the subcluster ‘b2’ consisted of four species (Fig. 4).

Table 6 Genetic similarity coefficients of 36 cotton species using two types of markers (EST-SSRs and gSSRs)
Fig. 4
figure 4

Phylogenetic analysis of 36 cotton species with combined data set of ESTs and gSSRs: The code represents the species as G. herbaceum A1 (1), G. herbaceum var. africanum A1 (2), G. arboreum A2 (3), G. thurberi D1 (4), G. klotzschianum D2 (5), G. harknessii D2-2 (6), G. davidsonii D3 (7), G. aridum D4 (8), G. raimondii D5 (9), G. gossypioides D6 (10), G. lobatum D7 (11), G. trilobum D8 (12), G. laxum D9 (13), G. tomentosum AD3 (14), G. hirsutum AD1 (15), G. mustilinum AD4 (16), G. barbadense AD2 (17), G. darwinii AD5 (18), G. hirsutum “marie-galante” AD (19), G. hirsutum “latifolium” AD (20), G. hirsutum “morrillii” AD (21), G. hirsutum “palmeri” AD (22), G. hirsutum “punctatum” AD (23), G. hirsutum “yucatanense” AD (24), G. hirsutum “lanceolatum” AD (25), G. anomalum B1 (26), G. barbosanum B3 (27), G. capitis-viridis B4 (28), G. somalense E2 (29), G. stocksii E1 (30), G. longicalyx F1 (31), G. incanum E4 (32), G. robinsonii C1 (33), G. australe G2 (34), G. nelsonii G3 (35), G. bickii G1 (36)

A total of nine D-genome species constituted a major cluster ‘C’ containing two subclusters (C1, C2). In the subcluster ‘C1G. klotzschianum Andersson D2 with G. davidsonii Kellogg D3 and G. herkensii Brandegee D2-2 with G. aridum D3 formed sister clustering, respectively. G. thurberi D1 was related to G. klotzschianum D2 and G. davidsonii D3 with genetic relatedness of 80 %. Similarly in subcluster ‘C2G. gossypioides D6 and G. lobatum D7 formed a sister group relationship. The most divergent species of the dendrogram was G. longicalyx F1 which was 62.15 % genetically related to all other species followed by G. laxum Phillips D9 which was 64.94 % genetically related to the other species.

Clustering of species with EST-SSRs

Cluster analysis based on EST-SSRs grouped the species into seven major clusters A through G (Fig. 5). The clustering of species with EST-SSRs (Fig. 5) is more close to the phylogenetic tree obtained from combined data set except few differences like grouping of G. raimondii with A-genome species and sister clustering between A- and D-genome species. Sister clustering between B- and E-genome and also between C- and G-genome species was observed with EST-SSRs as well as with combined data set.

Fig. 5
figure 5

Clustering of species with EST-SSRs The code represents the species as G. herbaceum A1 (1), G. herbaceum var. africanum A1 (2), G. arboreum A2 (3), G. thurberi D1 (4), G. klotzschianum D2 (5), G. harknessii D2-2 (6), G. davidsonii D3 (7), G. aridum D4 (8), G. raimondii D5 (9), G. gossypioides D6 (10), G. lobatum D7 (11), G. trilobum D8 (12), G. laxum D9 (13), G. tomentosum AD3 (14), G. hirsutum AD1 (15), G. mustilinum AD4 (16), G. barbadense AD2 (17), G. darwinii AD5 (18), G. hirsutum “marie-galante” AD (19), G. hirsutum “latifolium” AD (20), G. hirsutum “morrillii” AD (21), G. hirsutum “palmeri” AD (22), G. hirsutum “punctatum” AD (23), G. hirsutum “yucatanense” AD (24), G. hirsutum “lanceolatum” AD (25), G. anomalum B1 (26), G. barbosanum B3 (27), G. capitis-viridis B4 (28), G. somalense E2 (29), G. stocksii E1 (30), G. longicalyx F1 (31), G. incanum E4 (32), G. robinsonii C1 (33), G. australe G2 (34), G. nelsonii G3 (35), G. bickii G1 (36)

Clustering of species with gSSRs

Cluster analysis based on gSSRs grouped the species into six major clusters A through F (Fig. 6). Clustering of species with gSSRs deviated from both the EST-SSRs and combined data set. In gSSR-based dendrogram, E-genome species (G. somalense Gurke; E2) was grouped with D-genome species. Similarly, G. incanum (Schwartz) Hillcoat (E4) and G. stocksii Mast. (E1) were grouped with G. robinsonii (C1) and G. longicalyx (F1), respectively. Also, the two races of G. hirsutum (“lanceolatum” and “latifolium”) were grouped with D-genome species.

Fig. 6
figure 6

Clustering of species with gSSRs The code represents the species as G. herbaceum A1 (1), G. herbaceum var. africanum A1 (2), G. arboreum A2 (3), G. thurberi D1 (4), G. klotzschianum D2 (5), G. harknessii D2-2 (6), G. davidsonii D3 (7), G. aridum D4 (8), G. raimondii D5 (9), G. gossypioides D6(10), G. lobatum D7 (11), G. trilobum D8 (12), G. laxum D9 (13), G. tomentosum AD3 (14), G. hirsutum AD1 (15), G. mustilinum AD4 (16), G. barbadense AD2 (17), G. darwinii AD5 (18), G. hirsutum “marie-galante” AD (19), G. hirsutum “latifolium” AD (20), G. hirsutum “morrillii” AD (21), G. hirsutum “palmeri” AD (22), G. hirsutum “punctatum” AD (23), G. hirsutum “yucatanense” AD (24), G. hirsutum “lanceolatum” AD (25), G. anomalum B1 (26), G. barbosanum B3 (27), G. capitis-viridis B4 (28), G. somalense E2 (29), G. stocksii E1 (30), G. longicalyx F1 (31), G. incanum E4 (32), G. robinsonii C1 (33), G. australe G2 (34), G. nelsonii G3 (35), G. bickii G1 (36)

Discussion

Microsatellite polymorphism

Faint or failure of SSRs fragment amplification was expected because of the reason that SSR primer pairs were designed from the sequences derived from G. hirsutum. It is much likely that during evolution there was enough accumulation of mutations in annealing sites and or the loss of these loci in the diploid species which together may influence the annealing of these primers (Liu et al. 2000). In the present study, ~97 % of the SSRs were polymorphic. Such commonalities have been reported while studying the genetic divergence among 31 Gossypium species using RAPDs (Khan et al. 2000) and 25 diploid Gossypium species using SSRs (Guo et al. 2006). Such a high allelic polymorphism rate among various species is the result of accumulation of mutations during evolution (Nei 2007). Average alleles per locus (2.87) were slightly higher than the previous reports (2 allele; Wu et al. 2007b). Similarly, more alleles were amplified in tetraploids than diploids, which is in agreement with earlier reports (Gutierrez et al. 2002; Kalivas et al. 2011). Multiple folds (30–36) increase in ploidy level of tetraploids (Paterson et al. 2012) is one of the possible explanations for amplifying more alleles. The propensity of the number of alleles is positively correlated with the repeat number (Lacape et al. 2007), ploidy level of the germplasm (Udall and Wendel 2006), number of genotypes surveyed (Lacape et al. 2007; Guo et al. 2007) and the accuracy of system used for resolving amplicons.

Polymorphic information content (PIC), an important parameter, helps in choosing SSRs for evaluating germplasm, gene tagging etc. (Peng and Lapitan 2005). In the present study, higher PIC value for gSSRs versus EST-SSRs suggesting that transcribed portions of the genome are conserved in the genomes (La Rota et al. 2005; Eujayl et al. 2002). In multiple reports, inconsistency in PIC values data have been reported (Kebede et al. 2007; Liu et al. 2000; Kalivas et al. 2011) which is attributed to the kind of germplasm explored, bottleneck in domestication (Thuillet et al. 2004; Vigouroux et al. 2005) and the kind of DNA markers (Liu et al. 2000; Gutierrez et al. 2002). Also, PIC values of the SSRs surveyed on diploid species were higher than the tetraploids. Most diploid species were wild except A-genome species. Wild species were not domesticated suggesting selection pressure for accumulating particular type of alleles was not applied is the reason for escalation in PIC values (Vigouroux et al. 2002; Qureshi et al. 2004).

Genetic characterization

The PIC values can guide us to select the most informative SSRs for calculating genetic divergence (Candida et al. 2006), thus the number of SSRs can be reduced substantially (Candida et al. 2006) before initiating the genetic diversity and variety identification experiments (Macaulay et al. 2001; Masi et al. 2003; Jain et al. 2004). In this study, we proposed 22 (11 BNLs and 11 MGHES) out of the 75 SSRs—based on their high PIC values (PIC ≥ 5.0) and potential to amplify distinct DNA fragments for calculating the extent of genetic diversity among the 36 Gossypium species. Such findings were reported in multiple investigations including 39 SSRs for cotton genetics studies (Lacape et al. 2007) and 25 SSRs for G. arboreum accessions (Kantartzi et al. 2009). Large number of informative gSSRs had di-nucleotide repeats while larger portion of informative EST-SSRs had tri-nucleotide motif. Dominance of trimeric SSRs over the others is possibly due to the inhibition of non-trimeric SSRs in coding regions of genes for avoiding chances of frame shift mutations (La Rota et al. 2005). Another reason is that the high portion of trinucleotide repeats in coding regions may be due to the exertion of selection pressure for selecting particular single amino acid stretches (Morgante et al. 2002). Also the most informative SSRs contained ≥10 repeats which are in agreement with the previous studies (Vigouroux et al. 2002; Qureshi et al. 2004; Kantartzi et al. 2009). It was also observed that position of the SSR loci on the chromosome has no effect on the corresponding PIC values (Lacape et al. 2003, 2007).

In the present study, correlation was not observed between the rate of polymorphism and repeat motif type that is contradictory to the previous findings of Lacape et al. (2007). They found repeat motif type dependent polymorphism in cotton and showed that SSRs with GA repeat motif type exhibited higher PIC value with more number of alleles than SSRs with CA repeat motif while Thuillet et al. (2004) found SSRs with CA repeat motif type exhibited significantly fewer alleles than GA SSRs in wheat. This might be due to difference in nucleotide distribution in different genomes but still further investigations are required with higher number of markers to confirm whether polymorphism is repeat motif type dependent or not.

Performance of microsatellite between A, D and AD-genome species

The SSRs did not amplify distinctive fragments with genomic DNA of A-genome species but produced clear bands in the D- and AD-genome species were placed on D-subgenome of allotetraploid cotton (Lacape et al. 2003; Mei et al. 2004). Only (12) markers were D-genome specific reflecting substantial divergence of D-genome species from D-subgenome of allotetraploid cotton (Brubaker et al. 1999; Adams and Wendel 2004; Guo et al. 2006; Wu et al. 2007). Amplicon sizes (101–300 bp) of a number of SSRs were different in AD-genome species and their diploid ancestral species (A and D). Such commonalities have been reported (Syed et al. 2001) which are due to type/number of repeat motif and flanking sequences (Buteler et al. 1999).

Amplified fragments size distribution of species containing AD- and D-genomes was dispersive while alleles amplified in A-genome species were of intense distribution (Fig. 3). Our outcomes are contradictory to Liu et al. (2006) who found dispersive distribution of fragment sizes in G. arboreum and relatively concentrated distribution in G. hirsutum.

Cross species amplification and genome specificity

Genome/species-specific SSRs can be useful in monitoring introgression of specific genomic portion of the donor species into the adaptive species (Guo et al. 2007), that can be instrumental in assigning species to unknown plants and in distinguishing cotton species. In this study 15 genome/species-specific SSRs were observed.

The transferability of SSRs derived from tetraploids to diploids indicates evolution of all genomes from one ancestor. We reported a high transferability rate in A-genome as compared to D-genome, indicating that D-subgenome in tetraploids deviated during polyploidization from their progenitor D-genome (Liu et al. 2006). Second, the higher transferability rate in A-genome species may be due to larger size of A-genome (Edwards et al. 1974; Reinisch et al. 1994).

In this study, gSSRs (BNLs) showed low transferability (37.28 %) and high polymorphism rate across the species versus high transferability (54.72 %) and low polymorphism rate exhibited by EST-SSRs, primarily because of their conserved nature (Cho et al. 2000; Thiel et al. 2003). The EST-SSRs derived from fiber tissues showed high level of transferability in diploid genomes, confirming the presence of fiber related genes in all the cotton genomes. Phenomenon of transferability has also been reported in other crop species (Kuleung et al. 2004; Saha et al. 2004).

Genetic relationship of tetraploid species with their wild relatives

Among A-genome species, G. herbaceum (A1) was found relatively closer to G. hirsutum (AD1) while G. arboreum (A2) showed more closeness toward G. barbadense. Among D-genome species G. raimondii (D5) was more similar to G. hirsutum (AD1) (0.667). In another study, G. herbaceum was found more genetically near to G. hirsutum (0.69) as compared to G. arboreum (0.52). It has also been observed that G. raimondii (D5) and G. gossypioides (D5) are genetically more comparable to G. hirsutum (AD2) and G. barbadense (AD2) (Kebede et al. 2007). In few cytogenetic studies, it was elucidated that G. herbaceum is more comparable to the ancestor species of tetraploid cotton than G. arboreum (Endrizzi et al. 1985; Wendel 1989; Percival and Kohel 1990).

Genetic diversity and phylogenetic relationship in the genus Gossypium

For evolutionary studies of cotton species, basic requirement is to workout their phylogeny (Khan et al. 2000; Abdalla et al. 2001; Paterson et al. 2002) and to estimate the extent of genetic diversity (Khan et al. 2000). For genetic diversity and phylogeny studies of cotton species various methods based on morphology, meiotic behavior, genetic and molecular techniques have been deployed. In this study, two types of SSR markers (EST-SSRs and genomic SSRs) have been utilized to study the phylogenetic relationship among cotton species. It is clear from this study that diploid species are genetically more diverse from each other as compared to the tetraploid germplasm. In this study, low to moderate level of genetic similarity among Gossypium species has been estimated. This report is consistent with the findings of Abdalla et al. (2001)—calculated relatively high estimates of genetic diversity among Gossypium species using AFLP marker system.

Among tetraploid germplasm lowest genetic similarity of G. hirsutum “punctatum” (ancient cultivated race: Brubaker and Wendel 1994; Lacape et al. 2007) with the other tetraploid germplasm reveals the existence of unique/useful alleles in this species. Such races could be a promising source for broadening the extent of genetic diversity within cultivated cotton. G. hirsutum “latifolium”—genetically more close to G. hirsutum (Lacape et al. 2007), would have least obstacles (Lubbers and Chee 2009) in attempting crosses. Within tetraploids, high genetic similarity estimates between G. barbadense and G. darwinii and G. tomentosum and G. hirsutum are in consistent with the earlier reports (Liu et al. 2000; Wendel and Cronn 2003; Lacape et al. 2007). Also, the variation in restriction sites in cpDNA and rDNA and in allozyme (14 enzyme systems) demonstrated more distinctiveness of G. tomentosum from G. hirsutum (0.82) than from G. barbadense 0.65 (Dejoode and Wendel 1992) which is contradictory to our findings.

In cluster analysis, A-genome species made sister cluster with D-genome species using the data of both EST-SSRs and gSSRs while these made sister cluster with AD-genome species with combined data in a major cluster ‘A’. Such commonalities have been found using RAPD markers (Khan et al. 2000), cpDNA, ITS and combined data set based phylogenies (Seelanan et al. 1997) and cpDNA restriction site based phylogeny (Wendel et al. 1992). The sister clustering of A-genome species with the AD-genome species strengthens the concept that A-genome is the cytoplasmic donor of AD-genome (Wendel 1989). It is likely that genomes of A-genome species have chromosome of larger size and more recurrence of repetitive sequences in their genome as compared to D-genome (Geever 1980), thus producing opportunities to amplify similar sequences (homology) among the genomes. Also, the rate of evolution of D-genome is faster than A-genome (Adams and Wendel 2004). All B-genome species showed close relationship with each other using EST-SSRs, gSSRs and combined data sets (Figs. 4, 5, 6) which is consistent with the previous reports (Wendel and Albert 1992), and the ‘E’ genome species were grouped with ‘B’ genome species. ‘G’ genome species also showed closeness with each other using three data sets. All ‘G’ genome species were grouped with ‘C’ genome species using EST-SSRs and combined data set while with gSSRs ‘C’ genome species were grouped with ‘E’ genome species whereas ‘G’ genome species grouped separately and appeared to be distantly related to all other genomic groups, illuminating that all ‘G’ genome species share a common ancestor (Fryxell et al. 1992; Liu et al. 2001).

Gossypium genomes, eight in number, comprise of four major lineages, spread in three continents. In Australia C-, G-, and K-genome species were found while in America D-genome species were present. While in Africa/Arabia first lineage of the A-, B-, and F-genome species, and second lineage of the E-genome species were found (Fryxell 1979; Fryxell et al. 1992). Clustering of (B and E) and (C and G) genome in one cluster with EST-SSRs and combined data is probably because of evolving from a common ancestry. A large number of genomic data showed consistency with the aforementioned taxonomy of the cotton species (Wu et al. 2007).

With three types of data sets, all D-genome species were grouped in one cluster except G. raimondii (D5) that grouped with A-genome species using EST-SSR data. The grouping of G. raimondii D5 with A-genome species with EST-SSR data set was not surprising as in several studies it has been found isolated from rest of the D-genome species (Fryxell 1979; Parks et al. 1975; Phillips 1966). It is disjunct geographically from rest of the subgenus. In D-genome species cluster (G. aridum D4 and G. herkensii D2-2), (G. gossypioides D6 and G. lobatum D7) and (G. klotzschianum D2 and G. davidsonii D3) showed closeness with all three data sets. The position of remaining three D-genome species (G. thurberi D1, G. trilobum D8 and G. laxum D9) remained unresolved as these species were grouped in different clusters with both the independent and combined data of EST and gSSRs while in few studies close genetic relationship was reported between G. thurberi D1 and G. trilobum D8 (Wu et al. 2007; Guo et al. 2007). The close relationship of G. klotzschianum D2 and G. davidsonii D3 is congruent with the earlier reports (Wu et al. 2007; Wendel and Albert 1992; Guo et al. 2008).

The position of G. longicalyx, only F-genome species, also remained unresolved in the present study. With the gSSR data, this species has shown close association with the E-genome species (G. stockii E1) while with EST-SSR data set it showed kinship to uncommon tetraploid species ‘G. lanceolatum (AD)’. In earlier reports, high genetic similarity of G. longicalyx was reported with A-genome or allotetraploid derivatives (Wendel 1989). However, in majority of the earlier reports, relatively isolated position of G. longicalyx was reported (Phillips and Strickland 1966; Saunders 1961; Wu et al. 2007).

The phylogenetic tree of species with EST-SSR data is closely resembled with the tree obtained from the combined data set. On the basis of these findings it is, therefore, suggested that relatively limited number of EST-SSRs instead of using high number of markers can be instrumental in resolving phylogenies at species level. Moreover, this study confirms the usefulness of limited number EST-SSRs for fingerprinting of geographically isolated species (Gutierrez et al. 2002; Rahman et al. 2002b; Shaheen 2005) and determining the phylogeny of the species.

Conclusion

Genomic SSR markers are more informative for fingerprinting and for estimation of genetic diversity among the cotton species because of occurrence of more alleles in microsatellite regions. EST-SSRs are more powerful for determining the changes occurred in result of selection during domestication. Only few EST-SSR markers are sufficient for resolving phylogenetic relationship of cotton species instead of using large number of SSR markers. Number of repeats per locus showed positive correlation with the number of alleles amplified, allele size range and polymorphism information content. Repeat motif type and position of loci on the chromosome have no effect on polymorphism rate.

Tetraploid species amplified more alleles than the diploid species. The presence of 18 alleles (16.51 %) having 0.10 or lower frequency, is one of the indicators of mutations or introgressions of new alleles in germplasm pool. G. mustelinum, G. tomentosum, G. darwinii, G. hirsutum “yucatanense” and G. herbaceum var. africanum were found genetically dissimilar which can utilized in breeding programs for broadening the genetic base of the cultivated cotton species, a way for achieving the sustainability in cotton production in the changing climate.