Genome-wide association study, genomic prediction and marker-assisted selection for seed weight in soybean (Glycinemax)
- 4.5k Downloads
Twenty-two loci for soybean SW and candidate genes conditioning seed development were identified; and prediction accuracies of GS and MAS were estimated through cross-validation and validation with unrelated populations.
Soybean (Glycine max) is a major crop for plant protein and oil production, and seed weight (SW) is important for yield and quality in food/vegetable uses of soybean. However, our knowledge of genes controlling SW remains limited. To better understand the molecular mechanism underlying the trait and explore marker-based breeding approaches, we conducted a genome-wide association study in a population of 309 soybean germplasm accessions using 31,045 single nucleotide polymorphisms (SNPs), and estimated the prediction accuracy of genomic selection (GS) and marker-assisted selection (MAS) for SW. Twenty-two loci of minor effect associated with SW were identified, including hotspots on Gm04 and Gm19. The mixed model containing these loci explained 83.4 % of phenotypic variation. Candidate genes with Arabidopsis orthologs conditioning SW were also proposed. The prediction accuracies of GS and MAS by cross-validation were 0.75–0.87 and 0.62–0.75, respectively, depending on the number of SNPs used and the size of training population. GS also outperformed MAS when the validation was performed using unrelated panels across a wide range of maturities, with an average prediction accuracy of 0.74 versus 0.53. This study convincingly demonstrated that soybean SW is controlled by numerous minor-effect loci. It greatly enhances our understanding of the genetic basis of SW in soybean and facilitates the identification of genes controlling the trait. It also suggests that GS holds promise for accelerating soybean breeding progress. The results are helpful for genetic improvement and genomic prediction of yield in soybean.
Soybean (Glycine max), rich in both protein and oil, is one of the most economically important crops. It accounted for approximately 68 % of world protein meal consumption and 57 % of world oilseed production during the past decades (USDA-FAS, http://www.fas.usda.gov/). As a yield component, seed weight (SW) is a complex and agronomically important trait in soybean. It is considerably attributed to seed size, which is an important character of soybean cultivars and affects the quality of many soy products for human consumption, such as soy sprouts, soy nuts, edamame, soy sauce, natto and miso (Clarke and Wiseman 2000; Friedman and Brandon 2001). Therefore, dissecting the genetic basis of SW is helpful to improve soybean yield potential and soy food as well.
Seeds of flowering plants consist of three major compartments: embryo, endosperm and seed coat. Embryo and endosperm are two zygotic products of double fertilization with different genotypes. The embryo, that gives rise to the daughter plant, is diploid with a copy of genome from each parent. Whereas the endosperm, that provides nutrition for embryogenesis and seed germination, is triploid with one and two genome equivalents from paternal and maternal parent, respectively (Lopes and Larkins 1993). By contrast, the seed coat that encloses the embryo and endosperm is strictly of maternal origin and differentiated from maternal-derived integuments (Haughn and Chaudhury 2005). It can be further divided into an inner and outer coat known as the tegmen and testa, respectively. The final seed size and weight is determined by the coordination among the embryo, endosperm and seed coat. In dicots like soybean, however, the endosperm is absorbed by the cotyledon, a part of embryo.
Research in Arabidopsis suggested that the cell division and elongation of embryo, endosperm and integument, which are predominantly regulated by phytohormones, determine the final size of seeds (Jiang et al. 2013; Ohto et al. 2005; Riefler et al. 2006; Schruff et al. 2006; Singh et al. 2002). The Arabidopsis AUXIN RESPONSE FACTOR 2 (ARF2), one of the ARF transcription factor family members that bind to auxin-responsive elements in the promoters of auxin-regulated genes, plays a central role in various auxin-mediated developmental processes including seed growing (Okushima et al. 2005). The loss-of-function mutant arf2 resulted in an increased seed size and weight due to extra cell division in the integuments (Schruff et al. 2006). The Arabidopsis histidine kinase 2 (AHK2), AHK3 and AHK4 (also known as cytokinin receptors) fulfill important roles in cytokinin-dependent endospermal and maternal control of embryo size. Triple mutant ahk2,3,4 showed increased cell number and cell size of the embryo, resulting in enlarged seed size (Riefler et al. 2006). The transcriptional factor Arabidopsis APETALA2 (AP2), an ethylene responsive element binding protein and functioning in the floral and seed development (Jofuku et al. 1994), has broad effects on the development of embryo, endosperm and seed coat. Large seed size was observed in loss-of-function mutant ap2, which exhibited elongated integument cells, enlarged embryo and endosperm (Ohto et al. 2005).
In soybean, research has shown that seed size and weight are controlled by both environmental and genetic factors. The temperature and water availability during seed filling and the duration of seed filling substantially affect seed size and weight (Dornbos Jr and Mullen 1991; Egli et al. 1978). Genetic studies have identified many quantitative trait loci (QTLs) associated with seed weight or related traits in soybean (SoyBase, http://www.soybase.org/). However, because of the limited recombination or insufficient maker density, the previously identified chromosomal regions associated with SW were usually not fine enough to access candidate gene(s) (Niu et al. 2013; Salvi and Tuberosa 2005). This also hinders breeding efforts to improve seed weight and size in soybean through marker-assisted selection (MAS). Seed size in soybean is also believed to be relevant to domestication (Liu et al. 2007). The cultivated soybeans have larger seed size and weight than their wild relatives (G. soja). A recent study indicated that the GmCYP78A10 gene, strongly associated with seed size in soybean, underwent artificial selection during early stage of soybean domestication (Wang et al. 2014). However, the molecular mechanism underlying SW and seed size in soybean remains unclear although significant advance has been achieved.
With the decreased genotyping cost and improved statistical methods, genome-wide association study (GWAS) and genomic selection (GS) present promising prospects for genetic improvement of complex traits in crop species. GWAS with a population of unrelated lines and high-density single nucleotide polymorphism (SNP) markers is capable of identifying causal genes for a broad range of complex traits in different crops (Huang et al. 2010; Li et al. 2013; Morris et al. 2013). In soybean, it has been used to disclose the genetic architecture of agronomic traits (Zhang et al. 2015), seed composition (Hwang et al. 2014) and disease resistance (Wen et al. 2014). GS refers to marker-based selection by capturing the total genetic variance with genome-wide markers without identifying a subset of trait-associated markers (Meuwissen et al. 2001). In general, GS is expected to be more effective than MAS that based on a few loci for quantitatively inherited traits (Bernardo and Yu 2007). A large number of GS studies have been reported in crop species such as maize (Albrecht et al. 2011; Bernardo 1996; Piepho 2009; Technow et al. 2013) and wheat (Heffner et al. 2011; Poland et al. 2012; Rutkoski et al. 2011) for various agronomic traits and disease resistance. However, its application to soybean is rarely addressed (Bao et al. 2014; Jarquin et al. 2014), and validation with unrelated population has not been reported.
Therefore, the objectives of this study were to (1) better understand the genetic architecture underlying SW in soybean, and (2) explore the potential of marker-based prediction as a new approach in soybean breeding. Twenty-two loci associated with SW and putative candidate genes with Arabidopsis orthologs involved in seed mass determination were identified via GWAS. GS exhibited higher prediction accuracies than MAS in all the tests of both cross-validation and validation with unrelated panels obtained from GRIN (http://www.ars-grin.gov/). This study enhances our understanding of the genetic architecture of SW and expedites the identification of genes conditioning SW in soybean. It demonstrates that GS can increase breeding efficiency and is also useful for genomic prediction of yield in soybean.
Materials and methods
Plant materials and field trials
Three hundred and nine plant introductions (PIs), obtained from the USDA Soybean Germplasm Collection, were planted in single-row plots in a randomized complete block design with three replications at four environments: Aurora (44°18′N and 96°40′W, 2011), Brookings (44°27′N and 96°47′W, 2010 and 2012) and Watertown (45°06′N and 97°05′W, 2012) in South Dakota, USA. The plots were 3.05 m in length and 0.76 m in width (or row spacing), and 86 seeds were sown per plot/row. According to the GRIN (http://www.ars-grin.gov/), 92 % of the PIs are maturity group (MG) 0 and 8 % MG 00, and 91 % originated from Northern China. The detailed information about the 309 PIs was given in our previous publication (Zhang et al. 2015).
Phenotypic evaluation and statistical analysis
All plots were bulk-harvested individually after full maturity (R8 stage), and the seeds were dried in an air dryer. A sample of 100 cleaned seeds from each plot was randomly taken and weighed, and the data were presented as grams per 100-seed. The model for the phenotypic trait was yijk = µ + gi + lj + (gl)ij + bk(j) + eijk, where µ is the overall mean, gi is the genetic effect of the ith genotype, lj is the effect of the jth environment, (gl)ij is the interaction effect between the ith genotype and the jth environment, bk(j) is the block effect within the jth environment, and eijk is a random error following N(0, σe2). Broad-sense heritability was calculated on an entry-mean basis as H2 = σg2/[σg2 + σgl2/k + σe2/(rk)], where σg2 is the genotypic variance, σgl2 is the genotype by environment interaction variance, k is the number of environments, r is the number of replications. Estimation of variance components was performed by the varcomp procedure in SAS version 9.3 (SAS Institute, Inc., Cary, NC), with all effects considered to be random. The likelihood-ratio-based R2 was calculated to estimate the proportion of total variation explained by the mixed linear model (MLM) containing all identified loci for trait as described by Sun et al. (2010).
Genotyping and quality control
The Illumina Infinium SoySNP50K BeadChip was used to genotype the population as described by Song et al. (2013), and 42,509 SNPs were identified with a call success rate of 85 % or greater. Of them, 61 SNPs present in unanchored sequence scaffolds were excluded from further analysis. The dataset had a missing rate of 0.6 %, and the missing data were imputed using BEAGLE version 3.3.1 with default parameter settings (Browning and Browning 2007, 2009). SNPs with a MAF < 5 % in the population were excluded from further analysis as well. Finally, a total of 31,045 SNPs were used for GWAS. The genotypic data of these SNPs for the four GRIN panels were downloaded from the SoyBase (http://www.soybase.org/snps/index.php), and the imputation analysis was performed same as described above for the main association panel.
Linkage disequilibrium estimation
Pairwise LD between markers was calculated as squared correlation coefficient (r2) of alleles using R package synbreed (Wimmer et al. 2012). Due to the substantial difference in recombination rate between euchromatic and heterochromatic regions, r2 was calculated separately for the two chromosomal regions. The physical length of euchromatic and heterochromatic regions for each chromosome was defined as in the Gmax1.01 reference genome (SoyBase, www.soybase.org). Only r2 for SNPs with pairwise distance less than 10 Mb in either euchromatic or heterochromatic regions was used to draw the average LD decay figure by R script using the equation described by Remington et al. (2001). The LD decay rate of the population was measured as the chromosomal distance where the average r2 dropped to half its maximum value (Huang et al. 2010).
Genome-wide association analysis
To minimize the effects of environmental variation, best linear unbiased predictions (BLUPs) of genetic effect for each line were calculated using the R package lme4 (Bates et al. 2012) in the same model as described for phenotypic trait. The BLUPs were then used to fit various models for association analysis. The one-way ANOVA model for naive test, without correction of population structure and familial relatedness, was implemented in R (Team 2012). General linear model (GLM) with population structure and MLM accounting for both population structure and kinship were implemented in the Genomic Association and Prediction Integrated Tool (GAPIT) R package (Lipka et al. 2012; Zhang et al. 2010).
Genomic prediction and marker-assisted selection
Both cross-validation and validation with unrelated panels were conducted for GS and MAS. A modified cross-validation was performed as described by Wurschum et al. (2013) to estimate the prediction accuracy. Briefly, for a fivefold cross-validation, 62 PIs (20 %) of the association panel were randomly assigned to a validation set for each prediction, and all of the remaining PIs (247, 80 % of the association panel) were used as the training set. For the training population size-effect analysis, a subgroup (97, 147 or 197) of the remains was randomly selected as the training set. The loci effects were then estimated based on genotypic and phenotypic data of the training set. Finally, the estimates of loci effect were used to predict the genomic estimated breeding values (GEBVs) of the validation set based on the genotypic data. The whole process was repeated for 1000 times. Ridge regression best linear unbiased prediction (RR- BLUP) was used to predict GEBVs in GS, as it has been demonstrated an effective prediction method with high accuracy across a wide range of traits and crops (Heslot et al. 2012; Jarquin et al. 2014; Lipka et al. 2014). While in MAS, multiple linear regression (MLR) was employed. The prediction accuracy was calculated by dividing the average Pearson’s correlation (r) between the BLUPs of genetic effects and the GEBVs with the square root of the heritability (Resende et al. 2012). To investigate the prediction accuracies with different number of markers, three sets of markers (2000, 1000 and 500 SNPs in total, respectively) were formed by randomly selecting 100, 50 or 25 SNPs from each of the 20 chromosomes. Each set of markers was then used in the cross-validation as described above.
Both GS and MAS were validated with four unrelated data sets taken from GRIN (http://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?51015). They were SOYBEAN.EVALUATION.1MN63 (MN63), SOYBEAN.EVALUATION.1IL64 (IL64), SOYBEAN.EVALUATION.3IL83.2 (IL83.2) and SOYBEAN.EVALUATION.MS989 (MS989) (Supplementary file Table S1). Lines presented in the association panel were excluded from analyses, and thus the final population size was 270, 724, 192 and 425 for MN63, IL64, IL83.2 and MS989, respectively. The whole population of each panel was used for validation based on the marker effects estimated from the association panel. The prediction accuracies for these validations were simply estimated as the Pearson’s correlation (r), as no heritability was available for adjustment.
Statistics of phenotypes
The averaged SW over four environments showed a continuous distribution in the GWAS panel of 309 soybean PIs, with a wide range of variation from 7.3 to 23.6 g per 100-seed (Supplementary file Fig. S1). The ANOVA indicated that effects of genotypes, environments and their interaction were significant (Supplementary file Table S2). The correlations of trait performance across environments were quite high, averaged r > 0.85 (P < 10−10), indicating high repeatability of the trait performance. The estimate of broad-sense heritability was 0.97, suggesting that the majority of phenotypic variation in soybean SW is attributed to genetic effects.
Distribution of markers and linkage disequilibrium
A total of 31,045 SNPs with minor allele frequency (MAF) ≥0.05 was used to GWAS for SW after quality control, resulting in a genome-wide marker density of 29 kb per SNP. On the SoySNP50K, however, markers were designed to unevenly distribute across chromosomal regions due to the substantial difference in recombination ratio between euchromatic and heterochromatic regions (Song et al. 2013). In the present study, 74.6 % of SNPs were located in euchromatic regions, exhibiting a marker density of 20 kb per SNP in euchromatic region but 62 kb per SNP in heterochromatic region. Accordingly, the linkage disequilibrium (LD) decayed (r2 drop to half of its maximum value) at 326 kb in euchromatic region, while in heterochromatic region LD did not decay until 4285 kb (Supplementary file Fig. S2).
Genome-wide association analysis of seed weight
The MLM, taking both population structure and relative kinship into account (Yu et al. 2006; Zhang et al. 2010), was employed to conduct the association analysis. Principal component analysis was performed with the whole set SNPs to capture the overall population stratification of the association panel. The first four PCs that explained 28.2 % of the total genetic variation were involved in the MLM according to the BIC test (Supplementary file Table S3). The first three PCs were presented in the supplementary file Fig. S3. Compared with the GLM, which involves population structure only, and the naive model, which involves neither population structure nor individual relationship, the MLM showed a greater control of genomic inflation (type I error) (Supplementary file Fig. S4). We also conducted GWAS with a compressed MLM. However, similar results were found between the regular and the compressed MLM (Supplementary file Fig. S4). Therefore, all further analysis and results presented were referred to GWAS using the regular MLM.
Loci and SNPs significantly associated with seed weight, predicted candidate genes and previously reported QTLs for seed weight at similar genome regions
Seed weight 5-2
Seed weight 20-2, 36-15, seed length 1-14, seed height 1-13 and seed volume 1-10
Seed weight 2-2
Seed weight 2-2
Seed weight 34-13 and 35-1
Seed weight 2-7 and 27-3
Seed weight 27-2, 34-8, 25-4 and 37-6
Transcription factor MYB61-like
Seed weight 36-14
Seed weight 36-14
ARM repeat superfamily protein
Seed weight 36-14
Nucleotide-diphospho-sugar transferase family protein
Seed weight 29-2
Seed weight 35-7, 15-7
Seed weight 35-7, 15-7
Seed weight 5-1, 15-7, 17-1, 34-7, 35-7, 36-7. Seed volume 1-7, 1-8, seed length 1-10 and seed height 1-10,1-11, seed width 1-8
AUX/IAA family protein
Seed weight 34-5
Loci effects and prediction of candidate genes
Genomic prediction and MAS for seed weight
Lead SNPs of loci selected by stepwise method for MAS based on Akaike information criterion
No. of SNPs
Selected SNPs for genetic breeding values prediction
Gm04_14213918, Gm14_5352488, Gm18_59490788, Gm19_41013395, Gm20_481573
Gm04_14213918, Gm06_13151347, Gm09_40164588, Gm10_37088544, Gm14_5352488, Gm18_59490788, Gm19_41013395, Gm19_41144271, Gm19_42921997, Gm20_481573
Gm04_14213918, Gm04_47414790, Gm06_13151347, Gm06_15115808, Gm07_15662403, Gm09_40164588, Gm10_37088544, Gm14_5352488, Gm14_6324126, Gm18_59490788, Gm18_61540919, Gm19_41013395, Gm19_41144271, Gm19_42921997, Gm20_481573
Allelic segregation of the 15 selected loci in the five plant introductions (PIs) with extreme seed weight (SW) in the association panela
We further investigated the efficiency of GS with the entire set of SNPs and MAS with the 15 selected trait-associated SNPs, with which the highest prediction accuracy was realized in MAS as described above, in predicting SW using four unrelated populations obtained from GRIN (http://www.ars-grin.gov/). These panels represent a wide range of maturities (MG 000-VIII) and a large variation for SW (Supplementary file Table S1 and Fig. S5). An approximately normal distribution was observed for SW in the panels except MS989 with a later maturity (Supplementary file Fig. S5). The prediction accuracies of GS ranged from 0.63 for MS989 to 0.83 for 3IL83.2, 21 % averagely higher than that of MAS, which ranged from 0.42 for MS989 to 0.58 for 1MN63 (Fig. 4c).
In association study, relative kinship and population structure are two major confounding factors that may lead to spurious results (Yu et al. 2006). The MLM taking both familial relatedness and population structure into account has been demonstrated as an effective way to control genomic inflation and has been widely used in GWAS for various complex traits and plant species (Huang et al. 2010; Li et al. 2013; Morris et al. 2013). Our previous study also showed that a marker-based principal analysis is sufficiently flexible to generate trait-specific population structure that optimizes the model fitness (Zhang et al. 2015). In the present study, the first four PCs as suggested by BIC test were used to rule out spurious association due to population structure during association analyses (Supplementary file Table S3). We also compared different models and found that genomic inflation was attributed to both population structure and relative kinship in this study (Supplementary file Fig. S4). The MLMs (regular MLM and compressed MLM) showed a better control of genomic inflation than the naive model and the GLM with population structure, while the performances of the regular MLM and compressed MLM were similar (Supplementary file Fig. S4). It indicates that the feature of a trait should be taken into account for considering the optimal model, though compressed MLM was more powerful and effective in some association studies (Huang et al. 2010; Zhang et al. 2010).
In soybean, more than 200 QTLs for SW have been reported across 20 chromosomes (SoyBase, http://www.soybase.org/). Additionally, many QTLs associated with seed size, a trait highly related to SW, have also been identified. These QTLs could be used to confirm the loci identified via GWAS, especially the hot ones. In this study, 22 loci associated with SW were identified. Each individual locus could explain a small proportion (<4 %) of phenotypic variance. The result demonstrated that soybean seed weight is a typical quantitative trait, genetically conditioned by many minor-effect loci. Of the identified loci, 15 have been previously reported at least once (Table 1). Among these loci, the SW11 locus at 37.1 Mb position on Gm10 exhibited the strongest association. In the similar region, four SW QTLs have been reported previously. Another locus SW2 at the 14.2 Mb position on Gm04 has been reported to be associated with SW and seed size related traits, e.g. seed length, height and width, two and three times, respectively (Table 1). The SW21 on Gm19 was mapped within an overlapped region of six SW QTLs and six QTLs for seed size related traits reported previously, indicating that SW21 might be in one of the hottest regions related to SW in soybean (Table 1). The high repeatability of these loci across various environments and genetic backgrounds implies a great potential of marker-based breeding for SW in soybean. As SW plays an important role in soybean yielding, these loci are also useful for soybean yield improvement.
One of the primary advantages of GWAS is the high mapping resolution as compared with linkage mapping. This feature enables GWAS to further narrow down the chromosomal region of candidate QTLs and predict causal genes. In this study, we were able to map SW11 and SW22 to a chromosomal region of 218 kb on Gm10 and 65 kb on Gm20, respectively. Within the regions of SW11 and SW22, there were 13 and six putative genes located, respectively (Fig. 4). According to the “Seed development transcript count” track at the SoyBase website (http://soybase.org/), we further decreased the number of candidate genes to eight for SW11 and five for SW22, respectively. The closest seed transcript to the lead SNP of the SW11 locus, Glyma10g28250, was predicted to encode a MYB61-like transcription factor. In Arabidopsis, MYB61 is required for mucilage deposition and extrusion during imbibition of seed coat (Penfield et al. 2001). In mature seed, mucilage is present in a dehydrate form within epidermal testa cells, and is also important for seed germination and dormancy. Deficiency of mucilage was observed in some Arabidopsis mutants with phenotype in seed size and/or weight such as ap2 and transparenttestaglabra (ttg) (Debeaujon et al. 2000).
In addition, some candidates of interest for other loci were also identified. The candidate gene for SW19 locus, Glyma19g33421, annotated as an AHP family member (Table 1; Fig. 2b). AHPs play a key role in the two-component pathway by which cytokinin signaling is conducted. In the two-component system, AHPs transmit cytokinin signal from the cytokinin receptors AHKs (also known as histidine kinase) to downstream Arabidopsis response regulators (ARRs) as well as cytokinin response factors (CRFs), which include a sub set of AP2 transcriptional factors (Rashotte et al. 2006). Both the triple mutant ahk2,3,4 and the ap2 mutant showed enlarged seed size mainly due to increased size of embryo (Ohto et al. 2005; Riefler et al. 2006). Interestingly, the Arabidopsis penta mutant ahp1,2,3,4,5 has a similar phenotype of enlarged embryo and seed size (http://arabidopsis.org). Therefore, it is conceivable that Glyma19g33421, encoding an AHP protein, might be involved in cytokinin-mediated seed mass regulation pathway in soybean. The SW21 locus was mapped to a region on Gm19 similar to 12 previously reported QTLs for SW or seed size related traits. The candidate gene, Glyma19g35180, putatively encodes an AUX/IAA family protein (Fig. 2c). AUX/IAA family is one of the key regulators of the auxin-modulated genes and is involved in various developmental responses to auxin. It regulates the expression of auxin-induced genes by heterodimerization with ARFs (Reed 2001). Arabidopsis ARF2 is a general repressor of cell division. Loss-of-function mutant arf2 showed extra cell division in the integuments and dramatically increased seed size and weight (Schruff et al. 2006). However, it is uncertain whether Glyma19g35180 is involved in ARF2 mediated SW signaling in soybean. More studies like functional analyses of candidate genes are required to validate their possible roles in determining soybean SW.
GS is based on the genetic effects of dense markers across entire genome (Meuwissen et al. 2001). It hypothetically captures all genetic variations of a trait, while MAS is usually based upon a small number of loci, particularly major-effect loci. In our study, GS consistently outperformed MAS for SW in soybean for various cross-validations within the association panel, even though in the case that the prediction accuracy of MAS might have been overestimated when the same population as for trait-associated loci identification was used. Moreover, the superiority of GS was more obvious when unrelated panels were used in validation (Fig. 4c), indicating that GS is more robust than MAS in predicting soybean SW. In addition to the reason mentioned above that more genetic variants can be captured by GS than by MAS, the higher prediction accuracy of GS can be partially due to its ability to well capture the genetic relationship between training and validation sets, which also contributes to the prediction accuracy (Habier et al. 2007).
Compared with a previous GS study in soybean, the prediction accuracy of GS in the present study was relatively high. Jarquin et al. (2014) conducted GS for grain yield, plant height and maturity date in a population of 301 elite soybean breeding lines with 52,349 SNPs. The highest estimate of broad-sense heritability was 0.94 for maturity date, which is similar to the SW (0.97) in this study. In that study, however, the highest heritability-adjusted prediction accuracy was 0.70 for maturity date with the optimal prediction model, which is lower than the highest prediction accuracy of 0.83 in the present study. Because the two studies had similar population size, number of SNPs and trait heritability, the difference in genetic architecture of traits and populations under study might be the major factors leading to the difference in prediction accuracies between the two studies.
Consistently with previous studies (Asoro et al. 2011; Jarquin et al. 2014), we found that the prediction accuracy increased as the training population size increased in both GS and MAS. However, this tendency appeared in different patterns. In GS, the effect of training population size was relatively stable over predictions with different number of SNPs (Fig. 4a). While in MAS, the effect of population size was magnified with the increase of loci used for prediction (Fig. 4b). It implies that the prediction of MAS is more sensitive to the change of training population sizes, which may lead to changes of relatedness between training and validation sets. Minimizing markers while keeping reasonable prediction accuracy is helpful to reduce genotyping cost in genomic prediction. We found that decreasing the number of SNPs did not necessarily sacrifice the prediction accuracies of GS, suggesting over-abundance of SNPs in the association panel. It can be explained by the extensive LD of the soybean genome and/or that not all SNPs contribute to the genetic variation of SW in soybean. Our results also suggested that a subset of 2000 SNPs (100 SNPs per chromosome) could be capable of providing equivalent prediction accuracy as the whole set of 31,045 SNPs did. Further decreasing the number of SNPs might be practicable, depending on the budge availability and the requirement for prediction accuracy. To the unrelated population, further work is needed to address the marker density effect and the over-abundance of SNPs in soybean GS. In addition, the prediction accuracy of MAS with all the 22 loci identified was lower than that with 15 or 10 loci selected. It suggested that more loci involved in MAS might not necessarily lead to higher efficiency as expected. Increasing the number of loci/markers used in MAS may not be an appropriate option in some cases even if efficiency of selection is the only issue to be considered.
In most of the GS or MAS studies, the training and validation populations are related, and cross-validation is used to estimate the prediction accuracy (Bao et al. 2014; Hoeck et al. 2003; Jarquin et al. 2014). However, their selection efficiency for unrelated lines is usually undetermined. In the present study, besides cross-validation, validation with unrelated populations obtained from GRIN (http://www.ars-grin.gov/) was also carried out. We found that the prediction accuracy of GS varied among the panels with different groups of maturity (Fig. 4c). This phenomenon might be caused by the different degrees of similarity in genetic compositions between the training panel and the validation sets. We noted that the prediction accuracies (0.74–0.83) for the early and medium maturity validation panels (MG 000-IV) were higher than that (0.63) of the late maturity panel (MG V-VIII). The high prediction accuracy of GS across a wide range of maturities may be benefitted from the broad genetic diversity of the association population and the dense SNPs used in the study. The latter helps to maximize the LD between markers and causal genetic variants and also increases the potential of capturing the genetic relatedness between training and test sets (Gowda et al. 2014; Habier et al. 2007). According to the information from USDA (http://www.usda.gov/), soybean cultivars with MG 000-IV accounted for over three-fourths of the U.S. soybean production in 2013. Therefore, GS with the marker effects estimated in this study hold a great potential in predicting soybean SW. Given SW is an important component of grain yield, the results are also useful for the genomic prediction and genetic improvement of soybean yield.
In this study, we identified 22 loci associated with SW via GWAS and thus convincingly demonstrated that soybean SW is conditioned by numerous loci of minor effect. We also refined candidate chromosomal regions for the known QTLs, including the hotspots on Gm04 and Gm19. Candidate genes with Arabidopsis homologs involved in seed development, auxin and cytokinin signalings controlling SW were proposed. Cross-validation showed that MAS could be a competitive alternative of GS when training and validation population were related or had similar genetic components. However, validation with unrelated populations across a broad range of maturities suggested that GS is superior to MAS. This study enhances our knowledge of the genetic basis of SW in soybean, and facilitates exploring the molecular mechanisms underlying the trait. It also benefits the genomic prediction of yield in soybean and suggests that GS would have a great potential in increasing genetic gains of soybean breeding.
Author contribution statement
J.Z. and G.-L.J. designed the research experiments. Q.S. and P.B.C. designed SNPs and performed genotyping. J.Z. performed phenotyping and analyzed data. G.-L.J. designed the overall project. J.Z and G.-L.J. wrote the manuscript.
This research was supported in part by the USDA-NIFA/CSREES Hatch project, the grants of the South Dakota Soybean Research and Promotion Council and the United Soybean Board awarded to G.-L. Jiang. South Dakota Agricultural Experiment Station, and Virginia State University Agricultural Research Station also provided support in completion of the research and manuscript. The authors thank Dr. Randall Nelson, the curator of the USDA-ARS Germplasm Collection, for providing germplasm accessions, Dr. Jixiang Wu, South Dakota State University, for assistance in statistical analyses, and Dr. Richard Joost, the United Soybean Board, for reviewing the manuscript. The authors are also grateful to Xianzhi Wang, Siddhi Bhusal, Mukhtar Agoub, Nicholas Hall, Marci Green and Sivananda Tirumalaraju for assistance in field experiments.
Compliance with ethical standards
Conflict of interests
The authors declare no conflict of interest.
- Bao Y, Vuong T, Meinhardt C, Tiffin P, Denny R, Chen SY, Nguyen HT, Orf JH, Young ND (2014) Potential of association mapping and genomic selection to explore PI 88788 derived soybean cyst nematode resistance. Plant Genome 7(3). doi:10.3835/plantgenome2013.11.0039
- Bates D, Maechler M, Bolker B (2012) lme4: Linear mixed-effects models using S4 classes. R package version 0.999999-0. http://CRAN.R-project.org/package=lme4
- Gowda M, Zhao Y, Wuerschum T, Longin CFH, Miedaner T, Ebmeyer E, Schachschneider R, Kazman E, Schacht J, Martinant JP, Mette MF, Reif JC (2014) Relatedness severely impacts accuracy of marker-assisted selection for disease resistance in hybrid wheat. Heredity 112:552–561CrossRefPubMedPubMedCentralGoogle Scholar
- Huang XH, Wei XH, Sang T, Zhao QA, Feng Q, Zhao Y, Li CY, Zhu CR, Lu TT, Zhang ZW, Li M, Fan DL, Guo YL, Wang A, Wang L, Deng LW, Li WJ, Lu YQ, Weng QJ, Liu KY, Huang T, Zhou TY, Jing YF, Li W, Lin Z, Buckler ES, Qian QA, Zhang QF, Li JY, Han B (2010) Genome-wide association studies of 14 agronomic traits in rice landraces. Nat Genet 42:U961–U976CrossRefGoogle Scholar
- Li H, Peng Z, Yang X, Wang W, Fu J, Wang J, Han Y, Chai Y, Guo T, Yang N, Liu J, Warburton ML, Cheng Y, Hao X, Zhang P, Zhao J, Liu Y, Wang G, Li J, Yan J (2013) Genome-wide association study dissects the genetic architecture of oil biosynthesis in maize kernels. Nat Genet 45:43–50CrossRefPubMedGoogle Scholar
- Morris GP, Ramu P, Deshpande SP, Hash CT, Shah T, Upadhyaya HD, Riera-Lizarazu O, Brown PJ, Acharya CB, Mitchell SE, Harriman J, Glaubitz JC, Buckler ES, Kresovich S (2013) Population genomic and genome-wide association studies of agroclimatic traits in sorghum. Proc Natl Acad Sci U S A 110:453–458CrossRefPubMedPubMedCentralGoogle Scholar
- Schmutz J, Cannon SB, Schlueter J, Ma JX, Mitros T, Nelson W, Hyten DL, Song QJ, Thelen JJ, Cheng JL, Xu D, Hellsten U, May GD, Yu Y, Sakurai T, Umezawa T, Bhattacharyya MK, Sandhu D, Valliyodan B, Lindquist E, Peto M, Grant D, Shu SQ, Goodstein D, Barry K, Futrell-Griggs M, Abernathy B, Du JC, Tian ZX, Zhu LC, Gill N, Joshi T, Libault M, Sethuraman A, Zhang XC, Shinozaki K, Nguyen HT, Wing RA, Cregan P, Specht J, Grimwood J, Rokhsar D, Stacey G, Shoemaker RC, Jackson SA (2010) Genome sequence of the palaeopolyploid soybean. Nature 463:178–183CrossRefPubMedGoogle Scholar
- Team RC (2012) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, AustriaGoogle Scholar
- Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New YorkGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.