Background

Bread wheat (Triticum aestivum L.), which is a widely cultivated cereal crop that is highly adaptable, provides approximately 21% of the total calories and 23% of protein in the human diet (www.fao.org/faostat/en). As a staple food for about 35–40% of the global population, wheat is a good source of nutrients and has unique gluten properties, making it useful for producing diverse food products [1]. The increasing global population and improvements in the standard of living for many people worldwide have forced breeders to continually aim to produce new high-quality and high-yielding wheat varieties [2].

Yield and quality are complex traits. Additionally, the limited genetic diversity of bread wheat has resulted in breeding bottlenecks, and the application of traditional breeding methods has led to gradual increases in wheat yield and quality [3]. Genome sequencing and high-throughput chip-based genotyping platforms are critical for clarifying the mechanisms regulating the wheat yield potential and quality as well as for enhancing breeding methods [4]. Several SNP arrays (e.g., 9 K, 35 K, 90 K, 660 K, and 820 K) have recently been developed. They have been used to analyze bi-parental populations and identify loci (QTLs) controlling yield- and quality-related traits [5,6,7,8,9]. However, traditional QTL mapping methods are usually based on specific characteristics of parental populations, and are time-consuming and laborious [10].

GWAS are common in breeding programs because they are more efficient and require less effort in analyzing complex traits under various environmental conditions than other research methods [11]. Specifically, GWASs have been useful for detecting yield-associated loci in wheat, including plant height (PH), kernel number per spike (KNPS) and thousand grain weight (TGW) [12, 13]. However, because of the need for many seeds and the substantial time required to assess some quality traits, there have been relatively few GWASs regarding wheat quality traits such as wet gluten content (WGC) and grain protein content (GPC) [14, 15]. Moreover, there are few reports describing a GWAS conducted to investigate lodging resistance, which is an important factor influencing wheat yield and quality.

For a GWAS, the size and diversity of the panel are important because a small panel and large linkage disequilibrium (LD) blocks may lead to the identification of false positive associations [16]. Regarding wheat, only a few GWAS for yield and quality traits have involved large natural populations and SNP chips. Furthermore, wheat has been cultivated in China for more than 4000 years and has now been cultivated in 10 major agro-ecological zones [17]. Due to the long evolutionary period, Chinese wheat germplasms have been artificially selected in different regions and have regional genetic characteristics [18, 19]. Accordingly, the objectives and requirements for improving wheat varieties differ considerably among these regions. Thus, we performed a GWAS of wheat yield and quality involving 543 representative bread wheat cultivars, including 531 Chinese wheat cultivars from 10 provinces, and a wheat 90 K SNP array following phenotypic analyses in six environments.

The aim of this study was to identify the stable loci and candidate genes significantly associated with wheat yield and quality. The results described herein may be useful for revealing the genetic basis of yield and quality. The corresponding SNP markers that were identified may ultimately facilitate the breeding of new high-quality and high-yielding wheat varieties.

Results

Phenotypic variation and correlation analysis

The phenotypic data for the 543 wheat lines characterized regarding growth- and development-related traits, yield-related traits, and quality-related traits in six environments are listed in Table S1. The phenotypic variations among genotypes were determined based on the heritability, range, mean, standard deviation, and the coefficient of variation. There were obvious variations for all traits, especially the coefficient of variation for thrust (TH) (49.28%) in the E5 environment. Table S2 provides the estimated correlation coefficients for this combined analysis. The broad-sense heritability (h2) for most traits was approximately 0.80, with the highest and lowest heritabilities detected for PH (0.92) and PET (0.70). Accordingly, most traits were stable and largely determined by genetic factors. The correlation coefficient was highest (0.970) between wet gluten content (WGC) and grain protein content (GPC), but was also relatively high between TGW and GPR (0.919), wet gluten content (WGC) and flour content (FC) (0.842).

Genome-wide association study

The 90 K wheat iSelect SNP array with 81,587 SNPs was used for genotyping. After a quality control step, 11,140 SNP markers remained for the association mapping [20]. A total of 270 significant SNP loci associated with yield and quality traits were identified (Table S3, Fig. S1). These SNPs were located on 21 chromosomes and accounted for 1.27–8.47% of the phenotypic variation. Moreover, 94, 139, and 37 SNPs were in the A, B, and D subgenomes, respectively (Table S3). Of these SNPs, 42 pleiotropic loci associated with two or more traits were detected on chromosomes 1B, 2A, 2B, 2D, 3A, 3B, 3D, 4B, 4D, 5B, 5D, 6B, 6D, 7A, and 7B based on the common loci (Table 1).

Table 1 SNP sites detected in two or more traits

Growth and development-related traits

A total of 28 significant SNP loci for the flag leaf length (FLL) were detected on chromosomes 1B, 1D, 2A, 2B, 2D, 3A, 3B, 4B, 5A, 5B, 5D, 6B, 6D, 7A, and 7B, accounting for 2.44–4.12% of the phenotypic variation. Regarding the flag leaf width (FLW), 33 significant SNP loci were detected on 13 chromosomes (1B, 2B, 2D, 3A, 3B, 4A, 4B, 5B, 5D, 6B, 6D, 7A, and 7B) and explained about 2.52–6.92% of the phenotypic variation. For the flag leaf area (FLA), the 40 significant SNP loci identified across six environments were detected on 16 chromosomes (1B, 2A, 2B, 2D, 3A, 3B, 3D, 4A, 4B, 4D, 5B, 5D, 6B, 6D, 7A, and 7B) and explained about 2.58–6.37% of the phenotypic variation. For the flag leaf angle (FA), 13 significant SNP loci were detected on nine chromosomes (1B, 2B, 3A, 3B, 4B, 4D, 5A, 6A, and 6D), accounting for about 2.06–3.35% of the phenotypic variation. Of the 68 SNPs identified for the flag leaf-associated traits, 11 pleiotropic loci were associated with three traits.

For the maximum tiller number (MTN), 37 significant SNP loci were detected on 14 chromosomes (1A, 1B, 2B, 2D, 3A, 3B, 4B, 4D, 5B, 5D, 6B, 6D, 7A, and 7B) and explained about 1.80–5.32% of the phenotypic variation. Five significant SNP loci for the heading date (HD) were distributed on chromosomes 2A and 5A, accounting for 2.73–5.94% of the phenotypic variation. Seven significant SNP loci for the mature period (MP) were detected on chromosomes 2B, 3B, 6A, and 7B and accounted for 2.70–3.75% of the phenotypic variation. The seven significant SNP loci for the grain-filling period (GFP) were detected on chromosomes 2A, 2B, 2D, and 7B and accounted for 1.37–2.99% of the phenotypic variation. Nine significant SNP loci for the grain-filling rate (GFR) were detected on chromosomes 1A, 1B, 2A, 2B, 4A, 4B, and 5B, accounting for 2.39–3.59% of the phenotypic variation. Eight significant SNP loci for TGW were detected on chromosomes 1B, 2B, 3A, 3B, 4B, and 6D, explaining 2.29–3.59% of the phenotypic variation.

For PH, 20 significant SNP loci were detected on 10 chromosomes (1A, 1B, 2B, 3A, 3B, 4A, 4B, 4D, 5A, and 6B) and explained about 1.96–4.42% of the phenotypic variation. For the diameter of the first internode (FD), 16 significant SNP loci were detected on nine chromosomes (1A, 1B, 2B, 3D, 4A, 4B, 5B, 6A, and 7A) and explained about 1.53–4.85% of the phenotypic variation. Regarding the length of the first internode (FIL), 17 significant SNP loci were detected on eight chromosomes (1A, 1B, 2A, 3A, 3B, 4D, 6B, and 7A) and explained about 2.13–5.76% of the phenotypic variation. For the diameter of the second internode (SD), the 18 significant SNP loci identified across six environments were detected on eight chromosomes (2A, 2B, 2D, 3B, 4B, 5B, 6B, and 7B), explaining about 2.15–5.02% of the phenotypic variation. For the length of the second internode (SIL), 15 significant SNP loci were detected on seven chromosomes (1A, 1D, 4B, 4D, 5B, 7A, and 7D) and explained about 2.57–5.74% of the phenotypic variation. For TH, 45 significant SNP loci were detected on 15 chromosomes (1B, 2B, 2D, 3A, 3B, 4A, 4B, 4D, 5B, 5D, 6A, 6B, 6D, 7A, and 7B), accounting for approximately 2.5–7.13% of the phenotypic variation. Among them, it is noteworthy that three loci on chromosome 4B are associated with FD and SD.

Yield-related traits

For spike length (SL), nine significant SNP loci were distributed on seven chromosomes (1B, 3A, 3D, 4B, 6D, 7A, and 7B) and explained about 2.42–4.33% of the phenotypic variation. Regarding the spikelet number per spike (SNS), 14 significant SNP loci were detected on six chromosomes (2A, 2B, 2D, 4B, 4D, and 7A), explaining about 1.27–4.02% of the phenotypic variation. The nine significant SNP loci for KNPS were detected on chromosomes 1A, 3A, 4A, 4B, and 5D and accounted for 2.58–4.32% of the phenotypic variation. Regarding the percentage of spike-bearing tillers (PET), 11 significant SNP loci were detected on 10 chromosomes (1A, 2A, 3A, 4D, 5B, 5D, 6B, 6D, 7A, and 7B) and explained about 2.74–3.55% of the phenotypic variation. For the spike number per mu (EPM), 17 significant SNP loci were detected on eight chromosomes (1B, 2A, 4B, 4D, 5A, 5B, 7A, and 7B) and explained about 2.33–8.47% of the phenotypic variation.

Quality-related traits

For the grain volume (GV), 17 significant SNP loci were detected on five chromosomes (1B, 3B, 3D, 6A, and 6B) and explained about 2.52–5.01% of the phenotypic variation. The two significant SNP loci for GPC detected on chromosomes 1A and 3D accounted for 2.35–2.53% of the phenotypic variation. Two significant SNP loci for WGC were detected on chromosomes 3D and 5D and accounted for 2.43–2.90% of the phenotypic variation. Ten significant SNP loci for the flour content (FC) were detected on chromosomes 1B, 4B, 5A, 5B, 6A, and 7B, accounting for 1.80–3.27% of the phenotypic variation.

Putative candidate gene analysis and expression data

In our study, the 200-, 380-, and 600-kb sequences flanking the related SNPs in subgenomes A, B, and D, respectively, were identified as potential candidate gene regions. A total of 638 putative candidate genes detected of the significant SNPs flanking-regions based on BLUP data were identified by screening the annotated genes in the recently released genome sequence (IWGSC RefSeq v1.0) (Table S4). We performed subsequent haplotype and expression analysis for the following three critical traits: SNS (Fig. 1), FD (Fig. 2), and GV (Fig. 3).

Fig. 1
figure 1

Genome-wide association study results for SNS and an analysis of the peak on chromosome 7A. a Manhattan plots for SNS. The horizontal line represents the significance threshold. The arrows indicate the location of the main peaks studied. b Genomic locations of four SNP loci and the LD based on paired R2 values between SNPs on chromosome 7A. c Haplotypes detected in 543 accessions based on the four SNPs. d Differences in the SNS among three haplotypes. e Structure of the TraesCS7A01G482000 gene. f Transcriptomic analysis of TraesCS7A01G482000 in the spike during different developmental stages. Gene expression was based on the fragments per kilobase of transcript per million mapped reads (FPKM) value

Fig. 2
figure 2

Genome-wide association study results for FD and an analysis of the peak on chromosome 4B. a Manhattan plots for FD. The horizontal line represents the significance threshold. The arrows indicate the location of the main peaks studied. b Genomic locations of four SNP loci and the LD based on paired R2 values between SNPs on chromosome 4B. c Haplotypes detected in 543 accessions based on the four SNPs. d Differences in the FD between two haplotypes. e Structure of the TraesCS4B01G343700 gene. f Transcriptomic analysis of TraesCS4B01G343700 in different tissues and stages. Gene expression was based on the fragments per kilobase of transcript per million mapped reads (FPKM) value

Fig. 3
figure 3

Genome-wide association study results for GV and an analysis of the peak on chromosome 6B. a Manhattan plots for GV. The horizontal line represents the significance threshold. The arrows indicate the location of the main peaks studied. b Genomic locations of five SNP loci and the LD based on paired R2 values between SNPs on chromosome 6B. c Haplotypes detected in 543 accessions based on the five SNPs. d Differences in the GV among three haplotypes. e Structure of the TraesCS6B01G295400 gene. f Transcriptomic analysis of TraesCS6B01G295400 in different tissues and stages. Gene expression was based on the fragments per kilobase of transcript per million mapped reads (FPKM) value

Regarding the SNPs associated with SNS, an association peak that included four significantly associated SNPs was detected on chromosome 7A (Fig. 1a, b). On the basis of the SNPs on chromosome 7A, three haplotypes were identified, with the mean SNS for haplotype II (19.57 ± 1.14) significantly lower than the corresponding values for haplotypes I (19.78 ± 1.33) and III (20.31 ± 1.43) (Fig. 1c, d). One of the four significant SNPs (BS00026622_51) was located at 942 bp in the third exon of the gene TraesCS7A01G482000. This locus was detected in four environments and in the BLUP model. SNP (C/A) at this location in the relevant genomic regions cause amino acids to change from leucine to isoleucine (Fig. 1e). The expression of the TraesCS7A01G482000 candidate gene when two nodes were detectable was significantly upregulated according to the RNA-seq analysis of the spike (Fig. 1f).

Among the SNPs associated with FD, four similar SNPs were distributed on chromosome 4B and all were detected in the BLUP model and in at least one environment (Fig. 2a). These four SNPs were BS00023035_51, Tdurum_contig48366_1324, Tdurum_contig50783_67, and Tdurum_contig50783_285, and four SNPs were separated 470.67 kb (Fig. 2b). Because of their close genetic relationship and significant correlation, these SNPs were used for the subsequent haplotype analysis, which revealed two distinct haplotypes (I and II). A total of 301 wheat materials were included haplotype I, with an FD of 3.25 ± 0.48 cm, whereas 208 wheat varieties were included haplotype II, with an FD of 3.17 ± 0.45 cm (Fig. 2c, d). The FD of haplotype I was significantly greater than that of haplotype II, implying lodging is less likely for wheat varieties with haplotype I than for varieties with haplotype II. Furthermore, three of the four significant SNPs were detected in the CDS of TraesCS4B01G343700. This gene contains two exons and the CDS comprises 2450 bp. The two SNPs located in the second exon resulted in amino acid changes from arginine to histidine and from aspartic acid to asparagine (Fig. 2e). The TraesCS4B01G343700 expression level increased in the stem and internode during wheat growth and development, peaking in the milk stage. This suggests that this gene helps mediate wheat internode growth and development (Fig. 2f).

The changes in GV were highly correlated with a set of SNPs in a 1.95 Mb genomic region (528.62–530.57 Mb) on chromosome 6B (Fig. 3a, b). These loci were detected in multiallelic and BLUP contexts. Three haplotypes were detected in 407 wheat lines on the basis of genotyping results. More specifically, haplotypes I, II, and III were detected in 339, 40, and 28 accessions, respectively. The mean GV of haplotype II (789.88 ± 8.58) was significantly lower than that of haplotypes I (798.50 ± 8.29) and III (799.19 ± 4.60) (Fig. 3c, d). Moreover, one of the four significant SNPs (Excalibur_c11245_880) was detected at 3315 bp in the TraesCS6B01G295400 coding sequence (CDS). The SNP (G/A) leads to changes in amino acids from arginine to histidine (Fig. 3e). The RNA-seq data revealed that this gene was highly expressed in developing seeds at 6 days post-anthesis (Fig. 3f).

Discussion

GWAS have been conducted to analyze the yield and quality of various crops, including rice [21], soybean [22], cotton [23], and sorghum [24]. The size of the study population is closely related to the accuracy of the association analysis. Long et al. [25] reported that increasing the population size increases the number of individuals with rare alleles, thereby increasing the accuracy and efficiency of the positioning of rare alleles. Therefore, our GWAS for yield- and quality-related traits involved a wheat 90 K SNP array for 543 wheat accessions in multiple environments.

We completed a broad-scale comparison of the results of our study and those of previous investigations. For PH, a locus (RAC875_rep_c105718_585) on chromosome 4D identified in four environments based on BLUP data is about one LD from a QTL (QPH.caas-4DS) described by Li et al. [26], This previously identified QTL was also stably identified in two Chinese bread wheat populations. A QTL (QKNS.caas-4AL) for KNPS between markers Kukri_rep_c106490_583 and RAC875_c29282_566 on chromosome 4A was detected earlier in four environments by Gao et al. [13]. Two sites (Excalibur_c9370_966 and wsnp_Ra_rep_c70233_67968353) related to KNPS revealed in our study are located within this QTL. Gao et al. also reported a stable QTL (QTKW.caas-4BS.1) for TGW between markers BobWhite_c162_145 and Kukri_c66885_230 on chromosome 4B. In the present investigation, a predicted TGW-associated SNP locus (Tdurum_contig97386_207) was detected between two markers on chromosome 4B. Additionally, TaAPO-A1/WAPO-A1 (TraesCS7A02G481600) was identified as a candidate gene for SNS through map-based cloning [27,28,29]. Two loci related to SNS (BS00026622_51 and RAC875_c19111_628) on chromosome 7A identified in four environments based on BLUP data are within one LD of TaAPO-A1. Dobrovolskaya et al. [30] cloned a WFZP (WHEAT FRIZZY PANICLE) gene related to SNS on chromosome 2D. This gene is within one LD of a stable SNS-related locus (GENE-0787_85) revealed in the current study. For the quality-related traits, our GPC locus (GENE-0411_807) overlaps the GPC QTL mapped to chromosome 1A by Kumar et al. [31]. Li et al. [32] identified a locus (IWB41869) related to starch granules on chromosome 7B, which is close to the FC locus (RAC875_c26057_370) we detected on the same chromosome.

In addition, we have also identified new stable SNPs that affect specific genes and have been detected in multiple environments. For example, TraesCS7A01G482000, which is related to SNS, was predicted to encode a haloacid dehalogenase (HAD)-like hydrolase domain-containing protein. In rice, the overexpression of OsHAD1 reportedly leads to enhanced phosphatase activity and increased total and soluble P contents in Pi-deficient transgenic seedlings during the early panicle development stage. Increasing the P uptake rate can promote spikelet formation [33, 34]. The RNA-seq data available in an online wheat gene expression database indicated this gene is highly expressed when two nodes are detectable, which coincides with a key period for wheat spikelet development.

Lodging resistance is another important yield-related trait. Four significant SNPs for internode diameter were identified on chromosome 4B. These SNPs were detected within TraesCS4B01G343700. In Arabidopsis, AtVPS25 (vacuolar protein sorting-associated protein) regulates auxin biosynthesis via its effects on the expression of specific auxin-related genes. An increase in the auxin content of wheat plants may lead to increased stalk diameters and enhanced lodging resistance [35,36,37]. A transcriptome-level analysis of TraesCS4B01G343700 in different tissues and developmental stages proved that this gene is highly expressed during the stem and internode development stage. The expression level peaked in the milk grain stage, implying the changes occurring in this stage have important implications for the lodging resistance of wheat plants.

Regarding the quality-related trait GV, we identified TraesCS6B01G295400, which encodes a LisH domain-like protein, as a candidate gene. In rice, OsLIS-L1 (lissencephaly type-1-like 1 protein) is important for male gametophyte formation, with mutations to this gene resulting in abnormal development. Additionally, the protein encoded by this gene influences grain characteristics and is closely related to the floral development and grain-filling stages [38]. An analysis of publicly available wheat RNA-seq data revealed that this gene is highly expressed in developing seeds, specifically in the starchy endosperm from day 6 to day 14 post-anthesis and showed a gradually decreasing trend. Accordingly, this gene is important for the early grain-filling stage.

Conclusions

In summary, we conducted a GWAS based on the wheat 90 K SNP array to investigate the yield- and quality-related traits in 543 major wheat accessions. The resulting data were analyzed to identify relevant SNP loci and candidate genes. We are currently developing Kompetitive Allele-Specific PCR markers for the significant loci. These markers will enable researchers and breeders to efficiently transfer alleles into elite wheat genotypes [39]. Additionally, a more thorough characterization of the candidate genes described herein may enhance our understanding of the molecular mechanisms regulating wheat yield and quality.

Methods

Plant materials and experimental design

A bread wheat panel of 543 genotypes including cultivars, regional test lines, and introduced parental lines was used, the details have been published in our previous paper [20]. During the two growing seasons, wheat plants grow in three places in Hebei Province. The locations were Baoding (115.5°48′E, 38°85′N), Cangzhou (116°80′E, 38°58′N), and Xingtai (118°9′E, 39°42′N). The six environments were designated as follows: 2016 Baoding (E1), 2016 Cangzhou (E2), 2016 Xingtai (E3), 2017 Baoding (E4), 2017 Cangzhou (E5), and 2017 Xingtai (E6). The field trial was completed using a completely randomized design. Each plot contained three 1.5 m rows with 0.25 m between rows. The plant spacing is about 2.5 cm. Wheat plants were cultivated following normal local practices.

Phenotypic evaluation

Twenty-five phenotypic traits were measured, including growth and development-related traits (FLL, FLW, FLA, FA, MTN, HD, MP, GFP, GFR, TGW, PH, FD, FIL, SD, SIL, and TH), yield-related traits (SL, SNS, KNPS, PET, and EPM), and quality-related traits (GV, GPC, WGC, and FC). The data recorded for each trait are summarized in Table S1. The phenotypic traits were assessed in all six environments. The phenotypic data for each environment and the BLUP data were used for the genome-wide association analysis.

Phenotypic data analysis

The descriptive statistical analysis and correlation analysis for the phenotypic data were completed using the SPSS 25.0 software. Pearson’s correlation coefficients were calculated to evaluate the correlations among the traits.

SNP genotyping

The wheat 90 K Illumina Infinium SNP array was used to genotype the association panel containing 543 accessions. The SNP data were clustered and automatically called using the Illumina BeadStudio genotyping software (Illumina, San Diego, CA, USA). The data were filtered to remove alleles with a detection rate less than 0.1 and a minor allele frequency less than 0.05 [12]. Additionally, samples with a loss rate greater than 10% and a heterozygosity frequency greater than 20% were eliminated.

Genome-wide association analysis

The population structure, relative kinship, and LD were analyzed in a previous study [20]. In the current study, we completed a GWAS using the GAPIT package [40] in the R software. A mixed linear model program (Q + K) [41], with the population stratification results and kinship as covariates, was used to minimize false positives [40]. The P value threshold was calculated based on the number of markers (P = 1/n, n = total number of SNPs used) as described by Li et al. [42]. Regarding the GWAS results, a P value of 1/11,140 (−log10P = 4.05) was used as the criterion for identifying significant SNPs.

Prediction of candidate genes and expression analysis

The ‘Chinese Spring’ Genome database (IWGSC RefSeq v1.0, http: //www.wheatgenome.org/) was used for predicting candidate genes for the significant sites revealed by the genome-wide association analysis. Specifically, candidate genes around the significant sites were identified according to the differences in the LD decay distance among chromosomal groups. The expression profiles of putative candidate genes were analyzed using a wheat gene expression database available online (http://www.wheat-expression.com/). This database, which includes 850 wheat RNA-sequencing samples and an annotated genome, reveals the similarities and differences between homoeolog expression levels in diverse tissues, developmental stages, and cultivars [33, 43].