Genome-wide SSR-based association mapping for fiber quality in nation-wide upland cotton inbreed cultivars in China

Nie, Xinhui; Huang, Cong; You, Chunyuan; Li, Wu; Zhao, Wenxia; Shen, Chao; Zhang, Beibei; Wang, Hantao; Yan, Zhenhua; Dai, Baoshen; Wang, Maojun; Zhang, Xianlong; Lin, Zhongxu

doi:10.1186/s12864-016-2662-x

Genome-wide SSR-based association mapping for fiber quality in nation-wide upland cotton inbreed cultivars in China

Research article
Open access
Published: 13 May 2016

Volume 17, article number 352, (2016)
Cite this article

Download PDF

You have full access to this open access article

BMC Genomics Aims and scope Submit manuscript

Genome-wide SSR-based association mapping for fiber quality in nation-wide upland cotton inbreed cultivars in China

Download PDF

Xinhui Nie^1,2,
Cong Huang¹,
Chunyuan You²,
Wu Li³,
Wenxia Zhao¹,
Chao Shen¹,
Beibei Zhang¹,
Hantao Wang¹,
Zhenhua Yan¹,
Baoshen Dai¹,
Maojun Wang¹,
Xianlong Zhang¹ &
…
Zhongxu Lin¹

5835 Accesses
74 Citations
1 Altmetric
Explore all metrics

Abstract

Background

Since upland cotton was introduced into China during the 1920s–1950s, hundreds of inbreed cultivars have been developed. To explore the molecular diversity, population structure and elite alleles, 503 inbred cultivars developed in China and some foreign cultivars from the United States and the Soviet Union were collected and analyzed by 494 genome-wide SSRs (Simple Sequence Repeats).

Methods

Four hundred and ninety-four pairs of SSRs with high polymorphism and uniform distribution on 26 chromosomes were used to scan polymorphisms in 503 nation-wide upland cottons. The programming language R was used to make boxplots for the phenotypic traits in different environments. Molecular marker data and 6 fiber quality traits were analyzed by the method of MLM (mixed linear model) (P + G + Q + K) in the TASSEL software package on the basis of the population structure and linkage disequilibrium analysis. The loci of elite allelic variation and typical materials carrying elite alleles were identified based on phenotypic effect values.

Results

A total of 179 markers were polymorphic and generated 426 allele loci; the population based on molecular diversity was classified into seven subpopulations corresponding to pedigree origin, ecological and geographical distribution. The attenuation distance of linkage disequilibrium dropped significantly up to 0–5 cM. Association mapping for fiber quality showed that 216 marker loci were associated with fiber quality traits (P < 0.05) explaining 0.58 % ~ 5.12 % of the phenotypic variation, with an average of 2.70 %. Thirteen marker loci were coincident with other studies, and three were detected for the same trait. Seven quantitative trait loci were related to known genes in fiber development. Based on phenotypic effects, 48 typical materials that contained the elite allele loci related to fiber quality traits were identified and are widely used in practical breeding.

Conclusions

The molecular diversity and population structure of 503 nation-wide upland cottons in China were evaluated by 494 genome-wide SSRs, and association mapping for fiber quality revealed known and novel elite alleles. The molecular diversity provides a guide for parental mating in cotton breeding, and the association mapping results will aid in the fine-mapping genes related to fiber quality traits and facilitate further studies on candidate genes.

Association mapping analysis of fiber yield and quality traits in Upland cotton (Gossypium hirsutum L.)

Article 26 July 2017

SSR-based association mapping of fiber quality in upland cotton using an eight-way MAGIC population

Article 01 February 2018

A MAGIC population-based genome-wide association study reveals functional association of GhRBB1_A07 gene with superior fiber quality in cotton

Article Open access 09 November 2016

Background

Cotton is one of the world’s most important cash crops, and cotton fiber provides an important raw material for the textile industry. There are four cultivated cotton species, including diploids of Gossypium herbaceum and G. arboreum and tetraploids of G. hirsutum and G. barbadense. G. hirsutum cottons (upland cottons) are planted widely due to their wide adaptability and high yield. These cottons account for more than 95 % of the world’s cotton production [1].

China is one of the largest cotton producing countries in the world but is not the country of origin for cotton. Cotton production and breeding were developed on the basis of introduced varieties in China [2]. Upland cotton is native to Central America. Trice, Lone star, Stoneville 2B, DPL15, Uganda, KK1543 and 611Bo have been introduced into China from the United States and the former Soviet Union since 1918 [3]. Cotton breeding in China experienced seven breeding generations, new varieties updates (1904–1920–1936–1948, 1949–1958, 1959–1964, 1964–1979, 1980–1984, 1980s–1990s, the middle and later periods of 1990s-now), and expanded planting area. Recently, thousands of upland cotton varieties and lines have been domesticated, bred and derived, in which more than 200 good varieties are widely employed in production. Thus, the five main cotton regions, including the Yellow River Region (YRR), the Yangtze River Region (YtRR), the Northwestern Inland Region (NIR), the Northern Specific Early Maturation Region (NSEMR) and the Southern China Region (SCR) formed gradually [4]. Due to the narrow genetic basis and long-term directional selection in breeding, the genetic diversity in these upland varieties is low [5–9]. Therefore, the study on genetic diversity of basic upland germplasms and derived varieties can reveal the cotton genetic basis in China, provide understanding of the genetic background and genetic diversity of existing germplasms, lay the foundation for effectively exploring and using genes of important traits for breeders, and ascertain the direction of germplasm innovation.

The majority of traits in crops, such as agronomy, yield, quality and resistance, belong to quantitative traits controlled by multiple genes and present continuous phenotypic variation in segregation populations. The quantitative trait loci (QTL) with minor contributions to trait phenotype and sensitivity to environments lead to difficulties of identifying them [10]. Recently, the development of molecular markers and the rapid development of statistical analysis methods for quantitative traits have provided a platform for the genetics of crop quantitative traits. With the increase of molecular marker and the release of cotton genome sequences, cotton genetic maps have become increasingly saturated [11, 12], and QTL have been identified for agronomic traits [13], fiber quality [14, 15], growth stages [16] and resistant traits [17–21] by linkage mapping. However, linkage mapping has its own limitations: the segregation populations are from two specific parents, and linkage mapping only refers to two alleles at the same loci; the limited number of reorganization events occurring in gene loci leads to QTL with low resolution, the precision of the linkage analysis is commonly up to 10–30 cM; the QTL detected in specific genetic backgrounds and environments cannot be extensively applied in other hybrid combinations and the environment, which should be further verified.

In recent years, exploring quantitative trait genes by association analysis has been one of the most active research topics in plant quantitative genetics. Association analysis, also known as linkage disequilibrium mapping or association mapping, is based on linkage disequilibrium and combines analyzing the diversity of target traits and gene (locus) polymorphism to identify marker loci with the functions of specific genes closely related to phenotypic variation. Association analysis offers the following advantages compared with traditional linkage analysis: taking the natural population as the experimental materials, detecting multiple alleles on the same locus and targeting single genes. However, obvious complements exist between linkage and association analysis with respect to the accuracy and breadth of QTL mapping, the amount of information and statistical analysis methods. Linkage analysis preliminarily locates the allele controlling a target trait; association analysis performs fast fine-mapping of the target gene [22]. Thus, it is necessary to combine these advantages to confirm the QTL by linkage analysis.

In cotton, researchers has been conducted on traits related to agronomy, fiber quality, yield, growing stage and resistance using association analysis, and multiple marker loci associated with the above traits, elite alleles and carriers for breeding materials [23–25] have been identified. However, the materials were limited in these studies, which originated from limited cotton regions whose representations were not sufficient. The markers used in association analysis did not uniformly distribute on each chromosome, so they could not cover the whole cotton genome.

In this study, 503 upland cotton inbred cultivars, including those that have been grown in China since 1918 and inbred cultivars developed between 1920 and 2011, were used as the population panel; 494 genome-wide SSR markers from our high-density interspecific genetic map with 5152 markers [12] were selected at an average 10 cM to genotype the population. The objectives of our study were: (1) to analyze the population structure of upland cotton inbred cultivars developed in China; (2) to detect the marker loci associated with fiber quality traits; (3) to explore the elite alleles and the typical carried materials for future molecular design breeding in cotton; and (4) to provide multiple candidate genes and lay a foundation for further fine-mapping and gene cloning.

Results

Molecular genetic diversity

Among the 494 genome-wide SSR markers, 179 primer pairs displayed polymorphism, accounting for 36.16 % of the total primers, with an average of 6.885 markers per chromosome. A total of 426 allele loci were detected, with an average of 2.379 alleles per marker (ranging from 1 to 8). The average number of genotypes per marker was 4.413 (ranging from 2 to 34). The average genetic diversity was 0.377 (ranging from 0.012 to 0.893). The average polymorphism information content (PIC) was 0.336 (ranging from 0.012 to 0.887) (Additional file 1).

The average genetic similarity coefficient variation among the 503 cultivars was 0.552 (ranging from 0.337 to 0.921) (Additional file 2). A two-dimensional diagram of the principal coordinate (PCA) analysis was produced based on the genetic distance (GD) matrix. From axis 1 to 3, the percentage of explained variance of individual was 31.36 %, 22.24 %, and 13.27 %, respectively. The variation among subpopulations accounted for only 4 % of the total variance and variation within subpopulations accounted for 96 % (Table 1).

Table 1 AMOVA of the populations (pops)

Full size table

Population structure

Three methods were used to determine the population structure. First, the genetic structure based on SSR markers was constructed by separating PCA plots, which revealed that the population was divided into 7 groups (Fig. 1). The results revealed that each group was relatively independent, but there was mutual fusion. The special characteristics in each cotton region were formed due to the unique climate and geographical ecological environment. For example, the early-medium maturity cotton varieties cultivated in dense planting were more suitable for the NIR and the NSEMR. Additionally, varieties in each cotton region, which were exchanged with each other, formed the same pedigree source. For example, as summarized in Additional file 3: Table S2, ZY10 served as a parent for ZY478 (NIR), ZY459 and ZY303 (YRR), ZY398 (YtRR). Thus, the cultivars in different cotton regions exchanged and maintained a relatively open system.

Secondly, based on Nei’s genetic distance, the population formed 7 distinct groups in the unrooted tree (Additional file 4: Figure S2a), including 106, 19, 147, 67, 9, 103, and 52 cultivars in Groups I to VII, respectively. Combining the genealogical, geographical and ecological distribution, each group was composed of cultivars from different sources but was dominated by cultivars from the same cotton area (Additional file 4: Figure S2b).

Thirdly, the population structure was analyzed using STRUCTURE software. The K value increased continuously with the increase of the LnP(D) value, and no such plateau or obvious upward inflexion point was reached in this panel (Fig. 2a). As shown in Fig. 2b, although the ΔK value decreased rapidly from K = 2 to K = 5, K = 7 represented the first peak (upward inflexion point), indicating that the population structure could be divided into 7 subgroups. The 7 subgroups included 79, 141, 28, 20, 225, 6 and 4 cultivars (Fig. 2c). Thus, based on the three clustering methods, this population should be classified into 7 subpopulations.

Linkage disequilibrium

The linkage disequilibrium (LD) of this population was analyzed using 179 SSR markers. In a total of 11628 pairwise comparisons of 426 polymorphic SSR marker loci, 27.71, 17.26 and 14.51 % of SSR marker loci demonstrated significant LD at P < 0.05, P < 0.01 and P < 0.005, respectively. Based on r² estimates, only 2.09 % (r² ≥ 0.05) and 1.30 % (r² ≥ 0.1) of the marker pairs showed significant LD. In addition, the LD distribution was unevenly distributed on each chromosome, where the loci of higher LD level dramatically concentrated on chromosome01, 02, 15, 19, 21, 24 and 26 (Fig. 3).

To identify the genome-wide LD decay, r² and D’ values of LD were plotted as a function of genetic distance in cM. The significant pairwise LD (r² ≥0.05) was observed between some SSRs loci pairs within 50 cM distance. The genetic distance within 0–25 cM rapidly reduced when genome-wide LD was r² ≥0.018 (Additional file 5: Figure S3a). Thus, genome-wide LD at r² < 0.03 (Additional file 5: Figure S3b) and D’ = 0.25 (Additional file 5: Figure S3c) was reduced to 0–5 cM, revealing potential for association mapping.

Phenotypic variation of fiber quality traits

The phenotypic data (Additional file 6) of fiber quality in eight environments were determined by best linear unbiased prediction (BLUP), and then the breeding value of each cultivar for six fiber quality traits was obtained for association analysis. The cotton cultivars from seven cotton ecological regions in this study represented a broad variation in each experiment site. The highest coefficient of variation in FUHML (5.410 %) and FU (1.098 %) was discovered in cultivars from the Soviet Union; FS (6.168 %) and MV (6.394 %) from the NIR; SF (10.201 %) and FE (10.669 %) from the NSEMR and the United States, respectively. The highest coefficient of variation was observed in FE (9.31 %), the lowest in FU (0.80 %). The heritability was higher in FUHML and FE (0.93 and 0.91), ranging 0.84 to 0.88 in the other five traits (Additional file 7).

The correlations of six fiber quality traits using the results of BLUP processing were listed in Additional file 8, and highly significant correlations were observed among the six fiber quality traits. There were positive correlations between FUHML and FS and FU and between MV and FE. There were negative correlations between FUHML and MV, FE, and SF; FS and FE and SF; and FU and SF and FE.

The phenotype trends of fiber quality are shown in Fig. 4. FUHML (Fig. 4a), FS (Fig. 4b), MV (Fig. 4c) and FE (Fig. 4f) had relatively stable changing trends in eight environments. The trait changing trends of FU (Fig. 4d) and SF (Fig. 4e) were less stable in the eight environments. For instance, in 2012 and 2013, the means of FU in Kuerle were 84.51 % and 85.91 %, respectively, with increasing trends, whereas they were 84.85 % and 84.01 %, in Huanggang, with decreasing trends (Fig. 4d).

The correlations between two environments were obtained among eight environments for the six fiber quality traits (Fig. 5). Among the six fiber quality traits, the correlation means were ordered as FUHML (0.593) > FE (0.581) > FS (0.474) > MV (0.445) > FU (0.410) > SF (0.380). It was more important to further analyze one trait between two environments; taking FUHML as example (Fig. 5a), the correlations ranged from 0.40 to 0.76, and the correlation was 0.76 for FUHML_12KEL and FUHML_13 KEL. The red line that was near the 45° as the line of greatest slope indicated a correlation between FUHML_12KEL and FUHML_13KEL.

Association mapping of fiber quality-related traits

Based on the genotype data, the PCA matrix, the kinship matrix, and the fiber quality traits data of the BLUP results in 8 environments, a mixed linear model was used to analyze the marker-trait associations. During association mapping, three models, GLM (P + G) + Q, GLM (P + G) + PCs, and MLM (G + P + Q + K), were compared with each other in the association analysis (Additional file 9). The control effect of the population structure for FUHML, FS, SF, and FE in the three models was similar. However, the MV analysis using the MLM-Q-K model was superior to the other two models, and the FU analysis using the GLM-Q and MLM-Q-K models was better than the of GLM-PCA model. According to the above comparison results, the MLM-Q-K model had better performance.

A total of 179 SSR markers were used for marker-trait association after filtering for 5 % minimum alleles, among which 91 (50.84 %) markers were associated with fiber quality traits at the P < 0.05 level. Fifteen markers were significantly associated at the P < 0.01 level (Fig. 6). An average of 3.5 markers was detected on each chromosome (ranging from 1 to 8), with the maximum of 8 markers on Chr01 and Chr19. One marker was generally associated with several traits. For example, MON_DC40013 on Chr01 was related to FUHML, SF, FU and FS, and NAU2564 on Chr07 was related to FUHML, SF and FS.

There were 216 loci associated with fiber quality components at the P < 0.05 significance level, among which 27 were significant at the P < 0.01 level. The range of phenotypic variation explanation (PVE) observed was from 0.58 % (MON_DPL0042b) to 5.12 (NAU3084c), with an average of 2.70 % (Additional file 10).

Among the 6 traits, FS was associated with the most loci, up to a maximum of 61 (P < 0.05) and 7 (P < 0.01); PVE ranged from 0.58 (MON-DPL0042b, P < 0.05) to 3.17 % (NAU2836a, P < 0.001), with a mean of 2.63 %. The remarkable contribution loci were NAU2858a (2.92 %), BNL3089a (2.86 %) and MONCGR5399c (2.73 %), especially at the P < 0.001 level.

FUHML was associated with the second number of loci, up to a maximum of 46 (P < 0.05) and 4 (P < 0.01); PVE ranged from 0.59 (NAU3092a, P < 0.05) to 3.11 % (NAU5480b, P < 0.001), with a mean of 2.35 %, in which NAU5480a (2.75 %) had a significant contribution at P < 0.001.

There were up to 42 (P < 0.05) and 6 (P < 0.01) loci associated with SF, and PVEs ranged from 0.68 (HAU2835b, P < 0.05) to 5.12 % (NAU3084c, P < 0.001), with a mean of 3.80 %. NAU3084b (5.08 %), MON-DPL0893a (4.43 %), and MON-DC40013b (3.03 %) contributed prominently to SF at P < 0.001.

There were 25 (P < 0.05) and 3 (P < 0.01) loci associated with FE, with PVE ranging from 0.70 (MON-CGR5423b, P < 0.05) to 2.99 % (HAU4806-SSCPa, P < 0.001). BNL846b and MON_CGR5113b contributed to FE at P < 0.01.

There were 23 (only at P < 0.05) loci associated with MV, with PVE ranging from 0.66 (NAU3138b) to 1.46 % (DPL0457a); NBRI-HQ524733b contributing to MV was detected at P < 0.05.

There were 19 (P < 0.05) and 7 (P < 0.01) loci associated with FU, and the PVEs ranged from 1.06 (HAU1279d) to 2.51 % (HAU1166a), with a mean of 2.14 %. The contribution loci were observed in MON-CGR5602a (2.39 %), MON-DPL 0893a (2.26 %), MON-DC40013b (2.21 %), NAU3084c (1.93 %), HAU1166b (1.87 %) and NAU3084b (1.82 %) at P < 0.01.

Exploring elite allele-related genes in the cotton genome

The reference sequences of 91 elite allele loci associated with fiber quality traits were explored based on related genes in G. arboreum, G. raimondii and G. hirsutum,.

Three allelic variation loci were related to gene functional annotation of fiber quality traits in G. arboreum (Additional file 11). HAU0211 was associated with FU and SF on Chr12; its homologous genes in G. arboreum and Arabidopsis thaliana were Cotton_A_01461 and AT5G16560.1, respectively, which were annotated as Home-domain-like HD-ZIP family with the function of promoting cotton fiber elongation and initiation. HAU1355 was associated with FUHML, FS and SF on Chr18; its homologous genes in G. arboreum and Arabidopsis thaliana were Cotton_A_16285 and AT4G32551.2, respectively, which were annotated as WD40 repeat-like-containing domain family with the function of promoting fiber epidermal cell initiation. MON-CGR5167 was associated with FUHML and FS on Chr11; its homologous genes in G. arboreum and Arabidopsis thaliana were Cotton_A_07705 and AT4G00050.1, respectively, which were annotated as basic helix-loop-helix (bHLH) DNA-binding superfamily protein family to promote fiber epidermal cell initiation.

Five allelic variation loci were related to the gene functional annotation of fiber quality traits in G. raimondii (Additional file 12). The homologous genes of BNL3436, associated with FS on Ch25 in G. raimondii and Arabidopsis thaliana, were Cotton_A_07705 and AT4G00050.1, respectively, with the gene annotation of UDP-glycosy-transferase 73B4, which is involved in cell wall synthesis and fiber development. HAU0211 and HAU1355 were discovered with the same gene functional annotation as in G. arboreum. NAU2564 was associated with FUHML, SF and FS on Chr07 with the same gene as MON_CGR5167 in G. arboreum. The homologous genes of NAU6627, associated with FS on Chr21, in G. raimondii and Arabidopsis thaliana were Gorai.007G150900 and AT5G43900.1, respectively, which have gene function related to Myosin 2 and may be connected with the cell skeleton.

Four allelic variation loci were related to the gene functional annotation of fiber quality traits in G. hirsutum (Additional file 13). HAU1355, MON_CGR5167 and HAU0211 had the same gene annotation, as described above. STV106, with the homologous genes Gh_A06G0097 and AT1G65910.1 in G. hirsutum and Arabidopsis thaliana, respectively, was a newly discovered allelic variation loci associated with FUHML, MV, FU, SF and FE on Chr06, whose gene annotation was an NAC domain containing protein 28, which thickens the secondary wall in Arabidopsis thaliana and the xylem and cell wall in cotton.

Discovery of superior alleles and typical materials

According to the genotype data of the loci associated with fiber quality-related traits identified at P < 0.05 and the phenotype data of the BLUP results of 6 fiber quality-related traits in 8 environments, 48 materials with superior alleles were discovered (Additional file 14). Taking FUHML as an example, 15 marker loci of positive phenotypic effects and 10 marker loci of negative phenotypic effects were found, with BLUP values ranging from 30.0 to 31.48 mm and from 22.87 to 25.98 mm, respectively. NAU1982a was the allelic variation locus with the maximum positive phenotypic effect (+0.473 mm) in ZY495; meanwhile, MON-CGR6378c was the allelic variation locus with the maximum negative phenotypic effect (−1.23 mm) in ZY83.

Discussion

Population construction

The population panel consisted of 503 cultivars including some basic germplasms introduced from abroad and evolved through three variety replacement stages (King, Trice and Lxme star were introduced to the Northern Cotton Regions in 1920s and partially replaced G. arboreum varieties; Stoneville4, Delfos531 and DPL14 replaced half of the G. arboreum varieties in the 1930s and 1940s; DPL15, Stoneville2B and Stoneville5A replaced G. arboreum varieties, which were planted for a long time, and outdated G. hirsutum varieties in the 1950s) [26], and breeding varieties from 1911 to 2011 in China. Compared with the sample size of the population in previous researches [24, 27], our population was more comprehensive than others and was larger than 500, which was sufficient for statistical power [22]. The population panel included cultivars from five main representative cultivated cotton regions and was thus enriched with abundant variations in yield, fiber quality and disease resistance. In this study, the evaluation of six fiber quality-related traits in eight environments showed wide variations (0.80 ~ 9.31 %), stable heritability (0.84 ~ 0.93) (Additional file 7), and stable changing trends of each trait in different environments (Fig. 4). Phenotypic traits analysis based on the BLUP results ruled out environmental effects and improved the accuracy of the complex quantitative traits. Both the composition of the population and the trait evaluation indicated that this population panel could be considered as an ideal resource for association mapping of quantitative traits in G. hirsutum.