Background

Cotton is one of the most important cash crops, both as the leading natural fiber resource for the textile industry and an important oilseed crop. Approximately 50 species are present in the Gossypium spp., and only 4 species are cultivated worldwide: 2 are diploids (G. herbaceum and G. arboreum), 2 are tetraploids (G. hirsutum and G. barbadense). These two tetraploid (2n = 4x = 52) cotton species both share the common progenitors, which formed by a natural hybridization between A genome and D genome 1–2 million years ago [1,2,3]. The G. hirsutum (Gh), known as Upland cotton, contributed over 95% of cotton fiber yield by its wide adaptation and high yield [4, 5]. Because of the long process of domestication and selection bottlenecks, the elite Upland cotton has a narrow genetic base and limited genetic diversity [3]. This limitation could be a serious obstacle to improve the fiber quality and maintain continuity in genetic effectiveness [4]. While G. barbadense (Gb), also known as Sea-island cotton or long extra staple cotton, has excellent fiber quality, disease resistance but lower yield [6]. Introgression of interspecific favorable alleles to the Upland cotton can make full use of its high productivity, and it will be an ideal solution for cotton breeding [7, 8]. Although both of their genome sequence shared parts of the homology [9, 10], limited successes have been made in cotton interspecific breeding [6, 11]. Therefore, identifying, cloning, and utilizing beneficial allelic genes from the Gb will be important.

The primary segregating populations such as F2, BC1, have been widely used in genetic analysis for genetic map construction and quantitative trait loci (QTL) mapping. However, several disadvantages such as temporary nature and large deviation for evaluating the small-effect QTL limited their applications in the complex QTL analysis and cloning [12, 13]. In recent years, chromosome segment substitution lines (CSSLs), or referred as introgression lines (ILs), produced by crossing and backcrossing the donor and recipient parents by marker-assisted selection (MAS), provide a useful approach to resolve complex genome and QTL mapping [8]. Each of the CSSLs has one or few homozygous chromosome segments of donor genotype in the genetic background of the recurrent parent [14], which combines the advantages of the near-isogenic lines and backcross inbred lines. Through repeatedly planted in various locations or in different years, CSSLs helped to improve the accurate resolution of the genetic effects in the interspecific genomes [15,16,17,18]. Since the pioneering work in tomato [19], several interspecific introgression line libraries have been produced in many crops [20, 21]. Based on traditional molecular markers, such as restriction fragment length polymorphism (RFLP), amplified fragment length polymorphism (AFLP) and simple sequence repeat (SSR), a lot of QTL have been identified. However, limited by low genetic diversity and genetic map density, these molecular markers can identify only a few QTL and cover a wide region in the genome, which reduce the direct application of the QTL in breeding [22, 23].

In recent years, whole-genome re-sequencing technology has been widely used in population genetic analysis [24,25,26]. The high-throughput genotyping platform of SNP markers has significantly driven the process of genetic mapping and QTL identification [27,28,29]. Compared with the low density of traditional molecular markers, SNP markers significantly improve the genome coverage and QTL mapping accuracy. Multiple novel QTL for the important agronomic traits have been identified in multiple crops [30,31,32]. Moreover, high-resolution SNPs are a versatile tool to characterize the relationships between genes and importantly agronomic traits [33].

The prospect of widening the genetic diversity and improving the fiber quality of Upland cotton by accessing the exogenous genes has encouraged interspecific hybridization and introgression efforts for many years [6]. Stunning fiber quality of the Gb promotes it’s widely use in interspecific hybridization. Benefiting from widely range of variations shown in the progeny from Gh × Gb population, a large number of QTL related to multiple traits have been identified (https://www.cottonQTLdb.org). Moreover, some genes controlling specific characteristics of the Gb have been fine-mapped or cloned, such as open-bud floral buds [34], okra leaf [35,36,37], and naked seed mutant [38, 39]. Other wild Gossypium gene pools also provide a broad genetic diversity for Upland cotton [40,41,42]. However, none of them used high-throughput sequencing technology for analysis, which partly because there was no ultra-high density genetic map covering the entire genome or high-quality tetraploid cotton reference genome in the public domain. In the last a few years, spells above have been lifted in our lab [10].

Here, a set of interspecific CSSLs derived from a cross between G. hirsutum cv. ‘Emian22’ and G. barbadense acc. 3–79, were developed by using molecular marker selection. Next-generation sequencing technology was used to re-genotype all the lines and their parents by re-sequencing. The CSSLs were evaluated by using PCR-based markers and high-quality SNPs, resulting in a total of 480 introgression segments and 1211 recombination bins, respectively. Fourteen important agronomic traits including yield, fiber quality and oil content traits were measured in five environments to detect QTL. The influence of the Gb chromosome segments in the Gh background was investigated in this study.

Result

Evaluation of introgression chromosome recombination fragments in CSSLs

After several generations of self-pollinated, 515 markers were selected to evaluate the locations of introgression segments from donor parent in the lines with multi-segments again. Based on the genotypes of the molecular markers and the basis of the physical locations, the lengths and the locations of the introgression segments in each line were determined (Table 1), and a physical map was constructed (MM-map) (Fig. 1a). A total of 480 introgression segments were identified in the 325 CSSLs using SSR markers, with introgressions ranging from the least 10 ones on chromosome A03, D02 and D04 to the most 30 ones on chromosome D11. Among these, 222 lines carried one introgression segment despite the differences in lengths, and 103 lines were classified into the multi-segments group (Additional file 1: Table S1).

Table 1 Comparison of genetic map and physical map in the CSSLs
Fig. 1
figure 1

Distribution of introgression segments in the CSSLs on the 26 chromosomes. a Physical map was constructed by SSR markers; b Physical map was constructed by whole-genome re-sequencing SNPs. Each row indicates a CSSL, and each column represents a chromosome. The black and red squares denote the homozygous donor segments from Gb; the light-gray and green represent the heterozygous from Gb; the grey background represents the genetic background of the Gh

Based on SNPs from the sequencing data, 17,992 recombinant bins distributed on the 26 chromosomes were identified, which ultimately constructed 1211 recombination chromosome introgression segments from Gb in the 313 CSSLs (Fig. 1b and Additional file 2: Table S2). None chromosome introgression segments were detected in 10 lines in the CSSLs populations based on SNPs. The physical length of the introgression segments ranged from 97 kb to 104.23 Mb, with an average length of 4.43 Mb. Based on the physical map (GR-map), re-sequencing data significantly reduced the number of SSSLs, only 54 lines carried only one donor segment, and the lines with less than four segments just closed to half of the population (Additional file 1: Table S1). Significant difference of introgressions appeared in Dt-subgenome with 14 one on D02 and 126 ones on D07 (Table 1).

Comparison of the genome coverage between SSR markers and SNPs

Based on the marker position of the genetic map, 6175.33 cM of the total length of the donor segments was counted by SSR markers, with 3462.62 cM of effective coverage length. The whole cotton genome coverage based on the genetic map was 78.42%, and At-subgenome had a lower coverage ratio of 73.73% compared with the 83.33% in Dt-subgenome. The lowest coverage was on chromosome A07 with only 25.46%, and the highest appeared in the Dt-subgenome with no missing on chromosome D08 (Table 1).

The physical map constructed by SNPs covered 2.24 times of the total length of the cotton genome (Additional file 3: Table S3), with 1922.93 Mb of effective coverage length and 86.11% whole genome coverage. Compared to the MM-map, GR-map had a higher percentage of coverage in At-subgenome (89.48% in At-subgenome vs 80.31% in Dt-subgenome). Although the coverage of 16 chromosomes exceeded 90%, there were still 4 chromosomes with coverage of less than 50%. Notably, chromosome A07 had the lowest coverage consistent with the MM-map result, and more than 98 CSSLs detected the same segment on the chromosome D07 located at 5.0–6.5 Mb.

Phenotypic variation in CSSLs

Significant differences were observed between the parents across multiple traits and multiple environments, such as seed cotton weight per boll (BWT), lint percentage (LP), seed oil content (SOC) and all fiber quality traits. Fourteen traits were evaluated in five environments except that SI was just investigated in two environments (Additional file 4: Table S4 and Additional file 5: Table S5), and all traits showed a continuous distribution in the CSSLs. The broad-sense heritability (H2) was lower than 50% for the yield-related traits, indicating that they were easily affected by the environment (Additional file 6: Table S6). Higher H2 value of the lint percentage (LP) (76%), fiber length (FL) (77%) and SOC (87%) indicated that they were more affected by the associated genes coming from the Gb-genome. Fiber quality of Gb was outstanding in all environments, while the mediocre level of the fiber traits was observed in the lots of the CSSLs. Interestingly, recombination of the interspecific genomes also produced various fuzz fiber mutations with different densities and colors (Additional file 7: Figure S1). The N29 line produced fuzz-less phenotype similar to the Gb reported previously [10].

Positive and negative correlations between evaluated traits were calculated (Table 2). Plant height (PH) and first fruit branch height (FFBH) showed weak correlations with each other and with the yield-related traits (BWT and LP). But significant correlations were observed between fiber quality traits. Fiber length (FL) was significant positively correlated with fiber strength (FS) and fiber uniformity (FU), while negatively with micronaire value (MIC), fiber elongation (FEL), short fiber content (SFC) and fiber mature content (FM). The higher value of the SI followed the principle of negative correlation between yield and fiber quality, which may in turn increase of SOC.

Table 2. Correlation coefficients of 14 traits in the CSSLs over 5 environments.

Genetic basis of the morphological mutation in the CSSLs

Although the donor parent 3–79, the genetic standard of Sea-island cotton, had undergone artificial selection, cognitive of the plant height type for Sea-island cotton still appeared in the CSSLs (Fig. 2a). The “open-bud” floral buds phenotype was found during the flower development with the exposed stigma and dead anther (Fig. 2b). The associated marker BNL3479 located on chromosome D13 was similar to the former research (Additional file 8: Table S7) [34].

Fig. 2
figure 2

Some CSSLs showing morphological variations. a Significant tall mutant plant, b Open-bud mutant with stamen necrosis, c Sub-okra leaf, d Comparison of the LMI1 gene structure in the CSSLs

By using the high resolution of recombination segments, the iconic characteristic of the Gb, sub-okra leaf trait was identified in the CSSLs. Two nearby KNOTTED1-LIKE HOMEOBOX I transcription factors homologous to the LATE MERISTEM IDENTITY1 (LMI1), Ghir_D01G021810.1 and Ghir_D01G021830.1, were located near the 61.14 Mb on chromosome D01. An 8-bp deletion in the third exon of the gene Ghir_D01G021810.1 showed the same mutation as reported previously (Fig. 2c and d) [37]. These examples showed that the high throughput detection methods could confirm an identified locus at a single gene-level resolution in this population.

QTL mapping yield-related and fiber quality traits in the CSSLs

To evaluate the valuable genetic loci of interspecific hybridization that are important in cotton breeding, QTL was mapped based on these CSSLs. The coverage fragments in the genome were divided into 620 blocks, with an average of 3.12 Mb ranging from 29 kb to 69.47 Mb (Additional file 9: Table S8). A total of 64 QTL for 14 traits were mapped on 20 chromosomes with 38 in At-subgenome and 26 in Dt-subgenome (Fig. 3 and Table 3). The phenotypic variation explained by each QTL ranged from 0.73 to 14.67%. There were 19 QTL for four yield-related traits (BN, BWT, LP and SI) and the favorite alleles were from the Gh background. All the QTL for BWT and LP had negative alleles from Gh background, suggesting that the Gh has been domesticated for high yield. While, two QTL had positive alleles for BN indicating that Gb also had the potential to increase yield production. A total of 28 QTL were detected for fiber quality traits, most of which (18/28) had positive alleles from Gb. Of these, completely co-localization was observed for FL and FS, indicating that there was a significant correlation between them. Eight QTL for MIC were detected on seven chromosomes which explained phenotypic variation ranging from 2.54 to 7.09%. Contrary to FEL and FU, the positive alleles of SFC and FM were contributed by Gh. Poor fiber quality phenotype in the CSSLs declined that the genetic recession has occurred in the interspecific hybrids between Gh and Gb.

Fig. 3
figure 3

Chromosomal distribution of QTL and WAF value. Colored bars show the location of QTL. Red and blue indicate the additive effects from Gh and Gb, respectively; white and grey represent no effect and gap, respectively

Table 3 Summary of the QTL in the CSSLs

Genetic recession in the CSSLs

Genetic recession was a widespread phenomenon in the distant hybridization population. Fiber quality is one of the primary goals of cotton interspecific breeding. In this study, 7 lines with longer FL and 4 QTL for FL were identified in the CSSLs. Interestingly, two lines (N180 and R88) did not contain the QTL intervals, and two QTL intervals (on A01 and D06) also did not appear in the longer FL lines. The 13 fiber quality QTL identified in the single segment substitute lines (SSSLs) was inconsistent with the results of the same traits in this study except q-FLA02 [10]. So, we designed a weight mean of additive effects of fiber quality (WAF) value to analyze the source of additive effect for minor-effect genetic loci. Based on the correlations among the fiber traits, the additive effect of the genome was calculated (Additional file 10: Table S9). As a result, At-subgenome from Gh showed a higher additive contribution to fiber quality, while D-subgenome from Gb showed opposite results (Additional file 11: Table S10). In the Gb genome, more than 80% regions of chromosome A012, D02 and D12 had an additive effect on fiber quality improvement (Fig. 3). In addition, there was no additive effect from Gb on chromosome D07. More than 90% regions of chromosome A11 showed the effect of Gh. Notably, the non-contribution effect for fiber quality in At-subgenome was signification higher than that in Dt-subgenome. Of these, both chromosome A08 and A12 from Gb or Gh had more than half of the regions contributing no effect for fiber improvement.

QTL mapping for SOC and substitution mapping of QTL locus q-SOCA01–1

Less concern of the SOC in Gb showed significant difference compared with the recurrent parent ‘Emian22’. A total of 12 lines showed extremely significant (p ≤ 0.001) and stable higher SOC than recurrent parent ‘Emian22’ (Additional file 12: Table S11), and 15 QTL were detected to be related to SOC using BLUPed data; of these QTL, 12 were firstly characterized and only two QTL for SOC have been reported previously in an interspecific population (Table 3) [43]. Fortunately, three SSSLs (N159, N160 and N161) contained the same block (block3) on chromosome A01, providing an excellent materials for further research. Compared with another 7 lines including the parents, these three lines showed extremely significant high SOC properties like the donor parent (Fig. 4). In the associated interval (block 3 ≈ 1.08 Mb), there were 69 and 70 annotated genes in the Gh reference genome TM-1 and Gb reference genome 3–79, respectively. A previously study showed that cottonseed oil accumulates rapidly at the middle-late stages (20 to 30 days post anthesis) [44]. Hence, we focused on the genes that are expressed in gradients in ovules with significantly higher expression levels than other tissues (root, stem, leaf and fiber) [10]. Among these genes, the Gene Ontology (GO) analysis indicated that only six were involved in fatty acid metabolism process in both genome (Additional file 13: Table S12). Unfortunately, it is not significant difference expression of these oil relate genes in ovule between Gh and Gb (Additional file 14: Figure S2). Intringuing, another gene, Gbar_A01G002860.1, encoding a predicted mitochondrial pyruvate dehydrogenase kinase (mtPDK), showed higher expression than its homologous gene Ghir_A01G003150.1. However, previous data from Marillia et al. reported that the seed-specific partial silencing of the mtPDK resulted in increased storage lipid accumulation in developing seeds [45]. Hence, this gene may play an important role in storage lipid accumulation in late developing stage of cotton seeds.

Fig. 4
figure 4

Substitution mapping of q-SOC-1 using the 9 introgression lines (ILs) on chromosome. A01 a White and black represent the genotype of ‘Emian22’ and 3–79, respectively. b Seed oil content value are shown for five environments, the CSSL_Gh represent the background of Emian22(include the line of N75, N12, N49 and N145) and the CSSL_Gb represent the background of 3–79(include the line of N159, N160 and N161). One ANOVA analysis for two lines and Dunnett’s multiple comparison for multiple lines. ***. Indicated significantly different at the 0.001 level

Discussion

Cotton is the most important cash crop and contributes to more than 95% of natural textile fiber. Currently, improving the fiber quality by broadening the genetic basis of Upland cotton cultivars has become imperative. Construction of interspecific introgression lines can make full use of the superior fiber quality advantages of Gb on the basis of high yield of Gh, and also provide an ideal strategy for resolving the complex genome and QTL mapping. Several CSSLs with excellent agronomic traits than the Gh were found in this study, which can be directly applied to improve the fiber quality or SOC in cotton breeding.

Development strategy of the cotton introgression lines

The ideal introgression lines aim to product a series of SSSLs in which all the introgression segments cover the entire donor genome. High cost-effective ratio of PCR-based molecular markers makes it the first choice for tracking the introgression segments due to absence of high quality reference genomic sequence. In this study, a high-density interspecific genetic map between Gh and Gb cotton was constructed and updated. In the early stage, few markers were selected from the primary genetic map to survey introgressions in the early generations, and then new markers were engaged in the advance generations with only targeted region selection after updating the high-density linkage map, which could be significantly reduce the workload during the development of the ILs population. However, identification of false or missing segments cannot be avoided. As a result, a wide range of gaps were found in At-subgenome by aligning the reference genome, especially on chromosome A01, A02, A03 and A06 (Fig. 5). Non-collinear arrangement and clustering of the SSR markers on the physical map significantly reduced the coverage of the genome. Significant clustering of SSR markers appeared at the both ends of multiple chromosomes, such as A02, A03, A06 and A08, which was consisted with that a lot of lines carried a long fragment detected by several sequential markers.

Fig. 5
figure 5

Comparison of genetic map and physical map in evaluating the CSSLs. Left and right show SSR markers and SNP markers position on the chromosome, respectively. Colors show the density of the SNPs. Marker’s position is linked by grey lines between two maps

Despite that, the high-density linkage map constructed by our lab still showed a certain advantage in this study. Several SSSLs were confirmed by genome re-sequencing which were identified by PCR-based molecular markers.

High-throughput genotyping technology provides highly reliable introgression

The whole genome re-sequencing technology provides a strategy to understand the entire genomic variations after having a high quality reference genomic sequence, which could help to improve the detection of the donor segments in the whole genome. In this study, the CSSLs were genotyped using next-generation sequencing following the project of the reference genome [10], and an ultrahigh-quality physical map by SNPs was constructed, which was a pioneer study to use this strategy for genotyping CSSLs in cotton. As a result, lots of small segments were newly detected by sequencing, which significantly reduced the number of corresponding chromosomes and candidate confidence intervals for the associated traits. Some segments containing the candidate genes cannot be effectively assessed by SSR markers, although these markers were closely linked with the target trait. For example, the sub-okra leaf shape gene was detected by whole genome re-sequencing, while the MM-map only showed that there was a marker associated with this trait. In this study, none introgression segments were detected in 10 lines by SNPs. The reason is that the introgression fragments in these lines identified by SSR markers are less than 100 kb in length, which were marked as ‘not available’ and filtered. Besides homozygous introgressions, a number of heterozygous fragments were detected on chromosome A01 and A08 after a few rounds of self-fertilization. For example, line R28 carried the heterozygous fragment covering almost the entire chromosome A08, and line R126 carried a wide range of heterozygous fragments on different chromosomes which may result in colorful phenotype of the fuzz fiber (Additional file 7: Figure S1). Consistent with the previous reports [46, 47], we speculate that this may be related to the interspecific segregation distortion.

Based on the above results, we conclude that construction of an ideal introgression population can follow this strategy: (1) PCR markers from high-density genetic map are used to construct the primary introgression lines in the primary generations to decrease the cost; (2) all the lines are genotyped by high-throughput re-sequencing technology to accurately identify the introgression segments; (3) further backcrossing of the lines carrying more than one segment will be performed to achieve the purpose of constructing SSSLs.

CSSLs constructed a platform for resolving the polygene hypothesis

Quantitative traits are usually regulated by multiple minor-efficient genetic loci, which modified by the genetic and external environments [48]. Different QTL for fiber traits were detected between the SSSLs [10] and the whole lines (this study), indicating that the genetic loci for superior fiber quality of the Gb was controlled by multiple genes and dispersed on different chromosomes. A notable evidence appeared in this study was that the CSSLs (N180 and R88) carried multiple donor fragments but did not contain the QTL loci, which means that the genetic effects of these introgression fragments were low enough to be detected as a major QTL. Consistent with the previous study of the introgression population, we aimed at dissecting the donor genome by MAS in this study. However, this strategy may undermine the genetic pattern of quantitative traits such as fiber quality traits, which commonly regulated by multiple genes at different development stages [49]. Hundreds of high expression levels of genes during fiber development also illustrated this view [10]. These co-effector genes derived from Gb donor were segmented and dispersed in different lines, which blocked the regulatory relationship between them. As a result, we summarized that fewer introgression fragments in the SSSLs may effectively block the interaction between different genetic backgrounds and between loci on different chromosomes, which facilitated the detection of the minor-efficient genetic loci [10]. While more introgression segments and higher genomic coverage, especially the long fragments, the noise and epistasis effects were effectively reduced, which improved the reliability of identifying major and stronger effective loci that can be directly applied into breeding in the future. Similar conclusion in previous reports just had a brief description [31, 50, 51]. However, correlations between phenotypes may indicate that complex quantitative traits are controlled by same gene or closely linked genes. Many fiber quality QTL were detected in the interval of block 59 in this study, which indicated that there still existed the single major genetic locus for fiber quality in the Gb genome. Therefore, we can conclude that the genetic locus controlling fiber quality in the Gb genome is the interaction of the major gene with the minor-effect polygenic loci scattered on different chromosomes, and the future breeding for improving fiber quality should try to pyramid more beneficial factors.

Sea-island cotton as an excellent resource for improving cottonseed oil content

Cottonseed oil has a large amount of unsaturated fatty acids [52]. Several lines with higher SOC were identified which could be directly used in oil improvement breeding, connecting with the higher value (87%) of the broad-sense heritability. Multiple QTL for SOC were detected on different chromosomes in this population, which suggested that there should be a network between genes controlling the SOC in the Gb. These results indicate that Sea-island cotton has a high potential in improving the SOC of Gh. In this study, we predicted that a PDK gene may regulated the SOC in Gb, which indicated that the growth advantages of Sea-island cotton may have a more positive influence on regulating other traits than Upland cotton. Complex fatty acid metabolism pathway and the diversity of lipid compositions increase the difficulty to propose the candidate genes in the confidence intervals. However, based on the genomic annotation variation combining with transcriptome and metabolome analysis, the relevant information of the lipid biosynthesis is sufficient to identify candidate genes in the future, which have been proved to be feasible [12, 53].

Conclusions

Plant breeding aims to integrate multiple desirable traits to obtain elite varieties. Introgression between different species is a key process to broaden the genetic basis of the breeding materials. In this study, we developed a CSSLs population carrying introgression segments from Gb in the Gh background. The whole-genome re-sequencing technology was applied to study the CSSLs to construct the high-quality physical map for each line, which provided more accurate introgression than in the map constructed by SSR markers. A total of 64 QTL were mapped for 14 agronomic traits and favorite Gb alleles for fiber quality were identified. Importantly, novel Gb alleles for increasing SOC were found. Our study not only offered guides for future molecular breeding to increase fiber quality and SOC, but also provided a reference basis for fine-mapping and map-based cloning genes to genetic improvement of Upland cotton.

Methods

CSSLs development

In this study, ‘Emian22’ (G. hirsutum) and ‘3–79’ (G. barbadense), were used to develop CSSLs. ‘Emian22’ is an upland cotton cultivar with high yield and moderate fiber quality in Hubei Province. And the ‘3–79’ is a genetic and cytogenetic standard line for G. barbadense with super fiber quality and high resistance to Verticillium wilt. ‘Emian22’ and ‘3–79’ are public available materials and have been kept in our laboratory nearly twenty years. The construction process of this CSSLs population has been brief described in the previous article [10]. In 2006, after four rounds of successive backcrossing, 254 whole-genomic SSR markers were selected to the whole-genome surveying 221 BC4 lines [54] (Additional file 15: Figure S3). The 82 BC4 plants covering the whole donor cotton genome were selected to be further backcrossed with ‘Emian22’, while some of these individuals were selected to be self-pollinated to produce BC4F2. In 2007, target regions were genotyped using the corresponding polymorphic markers in 1686 individual plants derived from 1028 BC4F2 and 658 BC5F1 individual plants. A total of 302 individuals out of them containing less than five, short chromosome segments and possibly covering the donor genome were selected, including 128 individuals with only one donor segment (Additional file 16: Figure S4). In 2008, 515 markers selected from the updated high-density linkage map [55], were used for re-evaluating the plants. About 312 individuals were selected, of which 162 individuals had less than three donor segments (Additional file 17: Figure S5). The plants having only one donor segment were self-pollinated to produce the homozygous CSSLs, and the others were continually backcrossed with ‘Emian22’ to produce the advanced backcrossing generation. In the same way in 2009, corresponding polymorphic markers were executed to identify the target segment in all the lines, including the self-pollinated lines. About 336 individuals containing the target region were selected, including 60 plants with only one donor segment (Additional file 18: Figure S6). In the subsequent process, same steps were executed to select the plants with the target segments. Until 2011, 337 individuals were obtained with 279 plants having less than three target segments, of which 151 plants having only one donor segment (Additional file 19: Figure S7). After two rounds of self-fertilization to ensure the homozygous genotype, a set of 325 CSSLs including 177 SSSLs were ultimately obtained.

Phenotype evaluation

All the CSSLs with their parents were planted in two replicated plots at three different locations which are authorized by local governments: Huanggang (HG), Hubei province and Shihezi (SHZ), Xinjiang province in 2015; Shihezi in 2016; Jingzhou (JZ), Hubei Province and Shihezi in 2017. Field management essentially followed the local agricultural practices. PH, FFBH, and BN were evaluated at blooming stage, including the morphology of the plants (leaf and flower). Twenty bolls from each line were hand-harvested from the internal middle parts of the plants at the mature stage in every year. Yield-related traits, such as BWT, LP, SI, were tested in this CSSLs. And seven fiber quality traits were investigated including FL, FS, MIC, FU, FEL, SFC and FM. The seed phenotypes were scored based on visual inspection; meanwhile, at least 10 g delinted seeds were used to measure for SOC by low field pulsed nuclear magnetic resonance apparatus (NMR) analyzer on a NM-12 (Niumai Analytical Instrument Corporation, China). Best linear unbiased predictions (BLUPs) with broad sense heritability (H2) were used to estimate phenotypic traits across all five environments in R package. Pearson correlation coefficients were calculated to analyse the relationship between traits using BLUPed data by SPSS 17.0 software (SPSS Inc., Chicago, IL, USA).

Estimating the introgression segments in CSSLs using SSR markers

Total genomic DNA of the CSSLs and their parents was extracted from the fresh young leaves at seeding stage using modified CTAB method [56]. A total of 515 SSR markers selected from the high-density interspecific genetic map were used to genotype the CSSLs. The length of Gb introgression segment was estimated by the graphical genotype of the markers. If one marker has the same genotype as the donor parent, this line is considered to carry the introduced fragment from donor parent at this genetic position; otherwise, the genetic background will be considered to be the same as the recipient parent. A segment flanked by two markers with genotype DD, DR, RR, were considered to be 100, 50, 0% of donor type, respectively (Additional file 20: Figure S8). The “D” and “R” represent the donor and recipient genotype, respectively. Thus, the length of the introgression segment was estimated to be the total length of the DD length and two half of DR length [31].

Identification of SNPs and introgression segments in the CSSLs

The CSSLs population was cultivated in the field in Wuhan, China, in 2017. Leaf tissues were collected for plant genome DNA extraction with the Plant Genome Extraction Kit (TIANGEN Biotech). The 177 SSSLs with the parents have been sequenced by Wang et al. [10]. The other 145 CSSLs were sequenced on the same Illumina HiSeq platform with at least 6× coverage (pair-end 150 bp; Additional file 21: Table S13). Meanwhile, the Gh parent line ‘Emian22’ was deep sequenced with 60× coverage. To redo SNP calling, all the clean sequencing reads were mapped on the G. hirsutum reference TM-1 genome using BWA software version 0.7.10 and SNPs were called using GATK software with previously reported method [10].

The CSSLs may had large introduced fragment at the Chromosome recombination interval, so the bin map could be a better strategy to instead consecutive SNPs. A slightly modified sliding windows approach [57] was applied to identify the donor segments from Gb (Additional file 22: Figure S9). Firstly, a total of 11,653,661 SNPs and an average of 5.3 per kb were detected between Gh and Gb, and used to construct the bin. Then, all the alleles represented by SNPs in each CSSL were filtered using SNPs from both parents. And only those having the same allele as one of the parents were retained. The genotype of each window was called with a window size of 50 kb and step size of 5 kb. The ratio of SNPs in the window was calculated (> 80% of SNPs had one parental genotype, the window was called as homozygous of one parent; otherwise, the window was called as heterozygous). Determination of the recombination breakpoints and construction of the bins were performed as described by Han et al. [57]. The regions between two adjacent bins with same genotypes less than 100 kb were defined as the same bin, and bins of less than 100 kb in length were filtered. The recombinant donor chromosome segments for each CSSL were constructed based on the recombinant bins.

QTL mapping and weight mean of additive effects of fiber quality evaluation

To identify the QTL, the Gb introgression segments were divided into several non-overlapping blocks (Additional file 23: Figure S10), ensuring each line carries as smaller overlapping chromosome region as possible. The BLUPed data of the five environments was used as the response variations of the 14 traits. QTL mapping and additive effect calculation were performed using RSETP-LRT-ADD mapping method with QTL IciMapping V4.0 software [58]. The block interval was used as the QTL location, and QTL was named based on the rules of the reporting in the Rosaceae (recommendations for standard QTL nomenclature and reporting in the Rosaceae 2014). To obtain potential candidate genes, the annotated genes were identified for a Gene Ontology (GO) analysis and the transcription profiles for different tissues of TM-1 and 3–79 were employed as a reference [10].

Based on the QTL mapping results, the additive effect of all the fiber traits were calculated. Contributions of the Gb to the fiber quality in the Gh background were estimated using a weight mean model. Based on the correlations between the fiber traits and the broad sense heritability, the WAF model was described by the following formula: t represents the fiber quality traits, Addt is the value of additive effects for each block, rt is the value of positive correlation coefficient and H2t represents the broad sense heritability of the related trait. The distribution of the WAF on chromosome was calculated based on the blocks interval.

$$ WAF=\frac{\sum {Add}_t{r}_t{H}^2t}{\sum {r}_t{H}^2t} $$