Background

Sudden death syndrome (SDS) of soybean [Glycine max (L.) Merr.], caused by the soil-borne fungal pathogen Fusarium virguliforme[1], is a considerable threat to soybean production [2]. The fungus infects soybean root systems and produces toxins that are translocated to the leaves, resulting in premature defoliation and pod abortion [3, 4]. In recent years, SDS ranked among the top five most damaging diseases of soybean in the United States [5]. In the Midwestern soybean producing area of the U.S., it is estimated that SDS has resulted in average losses valued at $190 million a year [6].

Host plant resistance is believed to be the most effective control measure for SDS [7]. Since no soybean genotypes confer complete immunity to this disease, soybean breeders still rely on quantitative resistance to SDS [7, 8]. The wide range of variation of susceptibility to both leaf scorch and root rot also provides a great opportunity to improve SDS resistance through genetic manipulation [9]. However, most of what we know about the genetic architecture of SDS resistance is based on traditional quantitative trait locus (QTL) linkage mapping using bi-parental populations. Fourteen QTLs dispersed throughout the genome, underlying resistance to root infection, leaf scorch or both, have been confirmed in several bi-parental populations [10]. However, the large confidence intervals for those QTLs impair the precise identification of causative genes. To date, only one resistance gene (GmRLK18-1) has been tagged and cloned [11]. This gene is at 1.071 kbp on chromosome 18, within the major resistance QTL (Rfs2). Association mapping, which exploits historical recombination events at the population level, has become a powerful alternative to linkage mapping in the dissection of complex trait variation at the sequence level [12]. A more specific strategy, genome-wide association (GWA) mapping, is a powerful complementary strategy for classical bi-parental linkage mapping to dissect complex traits and has been used with success in Arabidopsis[13], rice [14] and maize [15, 16]. The use of association mapping in soybean was therefore desirable to improve the mapping of important traits in soybean. So far, only a few association mapping studies, with limited numbers of markers, have been reported for dissecting agronomic traits in soybean [17, 18]. To the best of our knowledge, GWA mapping has not yet been employed to study any traits related to soybean disease resistance. Recently, the availability of a soybean reference genome sequence and the development of high throughput SNP assays has enabled GWA mapping in soybean [19, 20]. A previous study reported that approximately 1% of the 6,037 Plant Introductions (PIs) from the USDA Soybean Germplasm Collection were partially resistant to SDS [21]. Therefore, conducting an association study in assembled PI collections might not be feasible. Furthermore, previous research indicated that most of the PIs showing resistance to soybean cyst nematode (SCN) were also partially resistant to SDS [22]. Therefore, association mapping with released elite cultivars is more likely to identify superior resistance alleles that have been captured and accumulated by SCN or SDS breeding practices.

The goal of this study was to investigate the genetic architecture of soybean SDS resistance in released elite soybean cultivars. Here we present the first experimental results of GWA mapping for SDS, across two independent panels of elite soybean cultivars, using a high-density customized oligonucleotide genotyping array. We detected 20 QTLs including known candidate genes (or QTLs) as well as new candidate loci in the soybean genome. The identification of these loci will increase our understanding of mechanisms underlying SDS resistance, and provide valuable markers for breeding soybean lines with SDS resistance.

Methods

Sampling and genotyping

Two independent experiments were conducted in this study. Experiment 1 was done with a mapping population of 392 diverse soybean cultivars (association panel P1), consisting of 251 varieties released between 2010 and 2012 and 141 advanced breeding lines from Michigan State University. Experiment 2 used a set of 300 diverse G. max advanced breeding lines (association panel P2) developed by public breeders. The germplasm was chosen to represent a range of materials developed for the U.S. North Central soybean production area. Further information about the P1 and P2 panels is given in Additional file 1.

Soybean genomic DNA was extracted from young leaf tissue following the previously described method [23]. All the accessions in panel P1 were genotyped using the Illumina SoySNP50k iSelect BeadChip (Illumina, San Diego, Calif. USA) which consists of 52,041 SNPs [20]. All the accessions in panel P2 were genotyped using the Illumina SoySNP6k iSelect BeadChip (Illumina, San Diego, Calif. USA), which consists of 5,361 SNPs [24]. The chromosomal distributions of the SNPs of SoySNP50K and SoySNP6k BeadChip are shown in Additional file 2. Genotypes were called using the program GenomeStudio (Illumina, San Diego, Calif. USA). The SNP data were coded according to the standard codes for nucleotides derived from the International Union of Pure and Applied Chemistry (IUPAC). The quality of each SNP was checked manually as previously reported [25]. SNPs without physical position information and with low quality (call rate < 80% and or minor allele frequency < 0.05) across all samples were removed from the dataset.

Field resistance evaluation

The association panel P1 was evaluated for SDS resistance in a naturally infested SDS disease nursery at Decatur, Michigan during the growing season (May-October) in 2011 and 2012, where consistent, natural and heavy SDS disease symptoms was observed on susceptible checks. Four replications per year were grown in a lattice design with four-row plots 6 meters long. The association panel P2 was divided into four groups based on the maturity group I to IV, and were evaluated for SDS resistance during the summers of 2011 and 2012 in 14 locations including Michigan (Decatur), Iowa (Kanawha and Ames), Minnesota (Waseca and Rosemount), Illinois (Manito, Streator, Fairbury, Beardstown, Urbana, Shawnee town and Valmeyer), Missouri (Sikeston) and Ontario, Canada (Harrow). Four replications per year were grown in a lattice design. SDS was evaluated by scoring disease incidence (DI) and disease severity (DS) at the R6 growth stage, the stage at which pods contain full-size green beans at one of the four uppermost nodes with a completely unrolled leaf [26]. SDS leaf scorch DI was rated from 0% (no disease) to 100% (all plants symptomatic), and DS was measured on a scale from 1 to 9 as described in Additional file 3 (after Bond, J. unpublished). The SDS disease index (DX, 0–100) was calculated as DI × DS/9. In panel P1, mean values of DI, DS and DX across replicates and years were used in association analysis throughout the study. The trait distribution for DX and DI was slightly skewed towards susceptible, thus a square root transformation was used to normalize the trait distribution prior to further analysis. The association panel P2 was phenotyped in multiple environments; best linear unbiased predictors (BLUPs) were used for the overall association analysis in panel P2. The BLUPs for each line were calculated with the R package, lme4, using the equation Y ijk  = Line k  + Environment i  + Replicate (Environment) ij  + (Line × Environment) ik  + ϵ ijk , where Y ijk is the observed phenotype for the kth line in the jth replicate of the ith environment; Linek is the random effect of the kth line; Environment i is the random effect of the ith environment; Replicate (Environment) ij is the random effect of the jth replicate in the ith environment; (Environment × Line) ik is the random interaction effect of the ith environment and the kth line, and ϵ ijk is the error term. Analysis of variance (ANOVA) for the phenotypic data was performed with the R package, lm(stats) and anova.lm(stats). The heritability estimates were calculated using variance components obtained by ANOVA [27].

Population genetic analysis

Principal components analysis and neighbor-joining (NJ) trees were applied to infer population stratification. A pairwise distance matrix derived from the Nei’s genetic distance for all polymorphic SNPs was calculated to construct Neighbor-joining trees under PowerMarker version 3.25 [28]. Principal component analysis (PCA) was done using EIGENSTRAT [29] based on 5,578 SNPs and 2,587 SNPs with minor allele frequency (MAF) >20% and physical distance >60 kb for panels P1 and P2, respectively. Kinship matrices were calculated using TASSEL 4.0 [30] to determine relatedness among individuals based on the same sets of SNPs for the two panels [see Additional file 4]. Linkage disequilibrium parameter (r2) for estimating the degree of LD between pair-wise SNPs (30,345 SNPs for panel P1 and 4,297 SNPs for panel P2 with MAF ≥5%) was calculated using the software TASSEL 4.0 [30] with 1,000 permutations. The LD decay rate was measured as the chromosomal distance at which the average pairwise correlation coefficient (r2) dropped to half its maximum value.

Genome-wide association analysis

Two different models were used to test associations between the SNPs (MAF >5%) and disease assessment criteria. The first model was a simple model where a general linear model (GLM), containing only the SNP tested as a fixed effect, was used to test the association between the SNP and the disease assessment criteria. The second model is a mixed linear model (MLM) where, in addition to the SNP being tested, PCA matrix and relative kinship matrix were included as fixed and random effects, respectively. The GLM and MLM can be expressed as y =  + e and y =  + Pβ + Kμ + e, respectively, where y is the vector of phenotypic observations, α is the vector of SNP effects, β is the vector of population structure effects, μ is the vector of kinship background effects, e is the vector of residual effects, P is the PCA matrix relating y to β, X and K are incidence matrices of 1 s and 0 s relating y to α and μ, respectively [31]. Top six principal components were used to build up the P matrix for population structure correction in the two panels. Analyses were performed by the software TASSEL 4.0 which implemented the EMMA and P3D algorithms to reduce computing time [32]. False discovery rate (FDR) ≤ 0.05 was used to identify significant associations. In order to conduct conditional analyses to test for residual adjacent associations after accounting for a key SNP within the same chromosome, the key SNP was transformed to a numeric value and then added into the MLM as a covariate. A P-value threshold of 10-4, corresponding to an adjustment for 500 independent tests across the region examined, was used to declare statistical significance at secondary signals.

Results and discussion

Genetic diversity and phenotypic variation

Two independent association panels (P1 and P2) were genotyped using Illumina BeadChip containing 52,041 and 5,361 SNPs, respectively. SNPs with MAF of <0.05 and call rate <80% were excluded from further analyses to avoid problems of spurious LD. Final sets of 30,345 and 4,297 high-performing SNPs were used for all analyses. Among these SNPs, samples had an average call rate of >96.5% and between technical replicates yielded >99% pairwise concordances. From these SNPs, we observed an average nucleotide diversity (polymorphism information content or PIC) of 0.281 and 0.284 in panels P1 and P2, respectively. Compared with a previous study [33], these estimates showed that the overall genetic variation of the elite cultivars we studied represents about 80% diversity of soybean landraces. Less than 1.6% of heterozygous genotypes were observed in both panels, which is consistent with the highly inbred nature of cultivated soybean (Table 1). An examination of allele frequency distributions at polymorphic SNPs showed that both panels contained a large number of SNPs with a minor allele frequency (MAF) of <0.1 [see Additional file 5], reflecting the broad genetic diversity in the two association panels.

Table 1 Characteristics of SNPs tested in two association panels

In both association panels, we observed abundant phenotypic variation in SDS resistance measured by disease incidence (DI), disease severity (DS) and calculated disease index (DX, see Additional file 6). The mean DX distribution ranged from 0 to 96.3 in panel P1 and 0 to 82.0 in panel P2. The broad-sense heritability of DX was higher within two environments (two years in one location) in panel P1 (83%) than that across multiple environments (two years across 14 locations) in panel P2 (average 65%) [see Additional file 6].

Patterns of linkage disequilibrium

To characterize the mapping resolution for genome scans and GWA mapping, we quantified the average extent of genome-wide LD decay distance in panel P1 and P2. These estimates were approximately 270 kb and 460 kb, respectively, where the r2 drops to half its maximum value (0.24 and 0.19, respectively). Given that our average inter-marker distance (density) is 35 kb and 241 kb for the panels P1 and the P2 respectively, we expect to have reasonable power to identify common large effect variants associated with SDS resistance in both association panels. Overall LD decay distance in panel P1 was smaller than that in the panel P2 (Figure 1). LD decay distance in panel P1 was also smaller than previously published values in soybean [33, 34]. This difference may be attributed to smaller sample size and lower genome coverage of markers in P2 and previous studies. Since panel P1 had larger sample size and was genotyped with more SNP markers than panel P2, estimation of LD in panel P1 is more reliable. Linkage disequilibrium decay distance varies over different chromosomes, with 410 kb in chromosome 19, 100–200 kb in chromosomes 3, 4, 9, 10, 11 and 16, and 200–300 kb in the remaining 13 chromosomes in panel P1. These LD decay estimates are slightly higher than that in rice (75-150 kb) [14], but much greater than in maize (1.5-10 kb) [35]. This result is consistent with earlier estimate that LD extends to a much longer distance in self-pollinated species than in cross-pollinated species [12].

Figure 1
figure 1

Genome-wide average LD decay estimated in association panels P1 and P2. Decay of LD (measured as genotypic r2) as a function of distance between SNPs.

Population structure

According to the NJ tree analysis as well as PCA, association panel P1 had 4 genetic subgroups (Figure 2a and b), whereas panel P2 had 6 subgroups (Figure 2c and d). It has previously been suggested that the photoperiod response between different maturity groups may be the primary factor driving differentiation of cultivated soybean [36]. A Chi-square test was used to test whether the SNP-data-based clustering (NJ tree) is associated with maturity-group-based grouping in panel P1 and P2. The results showed very significant association (P < 0.0001) between the two grouping factors. Thus the photoperiod response might have driven genetic differentiation among the tested accessions in both panels [see Additional file 7]. The measure of population differentiation, FST, was estimated at 0.168 among the four subgroups of panel P1, suggesting a moderate level of differentiation within panel P1 [see Additional file 8]. The population differentiation of 6 subgroups within panel P2 was slightly less (FST = 0.135) but still similar to that between different soybean landraces (FST = 0.130) [33].

Figure 2
figure 2

Population structures of soybean cultivars in association panels P1 and P2. (a) Neighbor-joining tree of 392 accessions in panel P1. The four subgroups identified from the tree are color-coded in a and b. (b) PCA plots of the first two components of 392 accessions in panel P1. (c) Neighbor-joining tree of 300 accessions in panel P2. The six subgroups identified from the tree are color-coded in c and d. (b) PCA plots of the first two components of 300 accessions in panel P2.

GWA mapping for SDS resistance

Using the GWA strategy to dissect genetic architecture of SDS resistance in the two soybean association panels (P1 and P2), we successfully identified both known associations (candidate genes or QTLs previously reported in soybean), as well as new candidate loci in the soybean genome. The results of significant SNPs discovered in both association panels are summarized in Additional file 9, 10, 11, 12, 13, 14, and Figure 3. As shown in the quantile-quantile (QQ) plots (Figure 3b and f, Additional file 13b and f, and Additional file 14b and f), the distribution of observed -log10 P-values from the simple model, which did not include population structure (Q) and familial relatedness (K), departed from the expected distribution under a model of no association with significant inflation of nominal P-values. While the MLM method, which includes Q and K, allowed us to reduce the excess low P-values for DS, DI and DX (Figure 3d and h, Additional file 13d and h, and Additional file 14d and h). In both association panels, lower inflation of nominal P-values was consistently observed when the MLM method was used than when the simple model was used. Therefore, only the results from the analysis with the MLM model are presented below.

Figure 3
figure 3

Genome-wide association study of SDS in the two association panels. (a) Manhattan plots of the simple model for DX in the association panel P1. The - log10 P-values from a genome-wide scan are plotted against the position on each of the 20 chromosomes. The horizontal red line indicates the genome-wide significance threshold (FDR < 0.05). (b) Quantile-quantile (QQ) plot of simple model for DX in the association panel P1. (c) Manhattan plots of MLM for DX in association panel P1, as in a. (d) Quantile-quantile plot of MLM for DX in association panel P1. (e) Manhattan plots of the simple model for DX in association panel P2, as in a. (f) Quantile-quantile plot of simple model for DX in the association panel P2. (g) Manhattan plots of MLM for DX in association panel P2, as in a. (h) Quantile-quantile plot of MLM for DX in association panel P2.

Of the 52,041 SNPs evaluated in association panel P1, 30, 48 and 56 SNPs were significant, with FDR ≤ 0.05 for DS, DI and DX, respectively (Additional file 9, 10, 11, Figure 3a and c, Additional file 13a and c, and Additional file 14a and c). From the 5,361 SNPs evaluated in association panel P2, 6, 8 and 9 SNPs were significant, with FDR ≤ 0.05 for DS, DI and DX, respectively (Figure 3e and g, Additional file 12, Additional file 13e and g and Additional file 14e and g). To select major QTLs among all the significant SNPs, these SNPs were clumped by using LD block as a criterion [37], and the strongest association within each LD block was kept. After the clumping of SNPs, 20 QTLs for SDS resistance were identified and peak SNPs (strongest associations) are listed in Table 2. The peak SNPs at the identified loci explained approximately 54.5% of the phenotypic variance on average (ranging from 35.7% to 75.4% for different disease assessment criteria, Figure 4). A major QTL on chromosome 18 was found in both association panels (Figure 3).

Table 2 A subset of SNPs significantly associated with SDS resistance and the adjacent candidate genes
Figure 4
figure 4

Contributions of identified loci to phenotypic variance of DS, DI and DX. Numbers of loci used to estimate contributions to phenotypic variance are indicated at ends of bars.

QTL confirmation and candidate genes

We compared the positions of the significant SNPs identified in this study with the positions of the QTLs reported in bi-parental mapping studies and found considerable overlap between these SNPs and the reported genes or QTLs for SDS resistance. Of the 20 loci we detected in the two association panels, seven overlapped with previously identified QTLs (Table 2).

Notably, one of the overlaps is the QTL Rfs2/Rhg1 on chromosome 18. This locus consistently contributes more effective coinheritance of resistance to SDS and reduces infestation by SCN [10]. Previous fine map development did not resolve Rfs2 from Rhg1, suggesting that the underlying genes were either very closely linked or pleiotropic [11]. In this study, we did detect a cluster of associations spanning a physical region of 1.2 Mb (1.2-2.4 Mb) around three Rhg1 genes that were found to contribute to SCN resistance (Figure 5a, Additional file 9, 10, 11, 12) [51]. The cluster of associations also explained a major part of phenotypic variation of SDS resistance in both panels (Additional file 9, 10, 11, 12). If the three Rhg1 genes were pleiotropic, the peak SNP for SDS resistance should be located either within or in the same LD block with the three Rhg1 genes. However, the peak SNP (GM18-1709751) was not only located outside of, but also belonged to a different LD block than the three Rhg1 genes (Additional file 15). One possible explanation for this is that SDS resistance mediated by Rhg1 is also conferred by copy number variation (CNV) that increases the expression of a set of dissimilar genes. Alternatively, there exist other gene/s mediating SDS resistance that are closely linked with Rhg1. In fact, we found that the peak SNP (GM18-1709751) was located at approximately 2.2 kb upstream of GmRLK18-1, a gene that encodes a receptor-like kinase, and its resistance allele is sufficient to confer nearly complete resistance to both root and leaf symptoms of SDS [11]. Moreover, there was another significant SNP (GM18-1712832 with P-value of 1.2 × 10-8 for DX) located within an exon of GmRLK18-1. These results support previous studies with regard to the key role of GmRLK18-1. However, we cannot exclude the possibility that structural variation in the form of CNV may have functional importance and thus contribute to SDS resistance that is not captured by our SNPs.

Figure 5
figure 5

Regional plots showing association mapping results for SNPs located around Rfs2 / Rhg1 on chromosome 18. Before (a) and after (b) controlling for the effects of peak SNP, negative log10-transformed P-values from the MLM are plotted on the left vertical axis for panel P1; Negative log10-transformed P-values from the MLM are plotted on the right vertical axis for association panel P2. Blue horizontal dashed lines indicate the genome-wide significance threshold in P1 association panel. Previously identified genes controlling the SDS resistance are labeled.

We searched for additional independently associated SNP variants near the Rfs2/Rhg1 locus by conditioning on the peak SNP (GM18-1709751) at the Rfs2/Rhg1 loci. At 519 kb downstream of Rfs2/Rhg1, we found an independently associated SNP variant, Gm18-2228646 with P-value of 6.5 × 10-5 in the conditional model, as measured by DS. This SNP is located at 44 bp downstream of a gene encoding an aquaporin transporter (Figure 5b). Therefore, our results provide evidence for the presence of a regulating gene other than GmRLK18-1 that is associated with SDS resistance on chromosome 18.

Besides the Rfs2/Rhg1 region, we refined the mapping location with other significant SNPs within or adjacent to previously reported QTLs (Table 2). Notably, we repeatedly detected a cluster of associations, measured by DS, DI and DX, spanning a physical region of 0.7 Mb (36.5 to 37.2 Mb) near the Rzd locus (resistance to zygote death) on chromosome 7 in panel P1. The Rzd locus contributed to SCN resistance and was strictly co-inherited in phase with Rfs2/Rhg1 in an earlier study [57]. However, not all of the QTLs detected in previous bi-parental populations were detected in our association panels. The reason for failure to detect them may be that root infection severity is not included in our disease assessment criteria, so those QTLs associated with resistance to root infection cannot be identified. Alternatively, some QTLs may segregate at low frequency or not at all in our association panels, or the SNP coverage in this study is still insufficient to capture all of the haplotypes present in the diverse soybean varieties. On the other hand, we found 13 novel QTLs. Compared with the 7 loci within intervals of known QTLs, the 13 new loci are slightly weaker in terms of average P-value (6.47 × 10-5 vs 7.54 × 10-4) and explained phenotypic variance (8.14% vs 7.06%). However, some of them explain as much as, or even more phenotypic variance than that of known QTLs (Table 2). We checked whether these new QTLs were near loci for determinacy, maturity date, leaf, stem or root morphology and found that one new QTL at 33.6 Mb on chromosome 13 was within the interval containing QTLs for plant height and stem strength. Another new QTL at 34.8 Mb on chromosome 19 was located approximately 2 Mb downstream of a locus related with flowering time and leaf morphology. No new QTLs were found near loci related to determinacy, maturity date, or root morphology. To further validate these new loci, we developed five recombinant inbred line (RIL) populations and are currently conducting a confirmation study. To date, we have conducted conventional QTL mapping in three of the five RIL populations. Five out of the 13 novel QTLs have been validated in the three RIL populations (data not shown). Undoubtedly, the major loci identified in this study can be used to improve soybean for SDS resistance.

When we checked candidate genes containing or immediately adjacent to the significant SNPs, we found that diverse types of genes are probably involved in natural variation for soybean SDS resistance (Table 2). For instance, we identified one pentatricopeptide repeat (PPR) gene, which has certain features in common with disease resistance genes (R genes) [38]. We also identified two genes encoding leucine-rich repeat (LRR) domains, which are important in plant responses to a variety of external stimuli including pathogens (Table 2 and ref. 44). A gene with similarity to ubiquitin-like protein, which is required for host and nonhost disease resistance in plants [45], was also identified. Several other SNPs were within or adjacent to sequences annotated as homologous to genes previously shown to be involved in plant disease resistance (Table 2). Follow-up studies will focus on validating effects of these genes, uncovering the molecular mechanisms of complex SDS resistance in soybean and integrating this knowledge to dissect mechanisms underlying quantitative resistance to soil-borne pathogens.

Conclusions

In this study, GWA mapping with correction for population structure and cryptic relatedness identified multiple novel loci and refined the map locations of known loci related to SDS resistance in soybean. This information not only demonstrates that GWA mapping can be used as a powerful tool for dissecting disease resistance mechanisms in soybean, but also provides valuable markers for developing soybean cultivars with durable resistance against SDS. Moreover, the candidate genes containing these SNP loci represent promising targets for further efforts to pinpoint causal variants and to clarify how the implicated genes affect SDS resistance in soybean.

Availability of supporting data

The data sets supporting the results of this article are included within this article and its additional files.