Association analysis using SSR markers to find QTL for seed protein content in soybean
- First Online:
- Cite this article as:
- Jun, T., Van, K., Kim, M.Y. et al. Euphytica (2008) 162: 179. doi:10.1007/s10681-007-9491-6
- 753 Views
Association analysis studies can be used to test for associations between molecular markers and quantitative trait loci (QTL). In this study, a genome-wide scan was performed using 150 simple sequence repeat (SSR) markers to identify QTL associated with seed protein content in soybean. The initial mapping population consisted of two subpopulations of 48 germplasm accessions each, with high or low protein levels based on data from the USDA’s Germplasm Resources Information Network website. Intrachromosomal LD extended up to 50 cM with r2 > 0.1 and 10 cM with r2 > 0.2 across the accessions. An association map consisting of 150 markers was constructed on the basis of differences in allele frequency distributions between the two subpopulations. Eleven putative QTL were identified on the basis of highly significant markers. Nine of these are in regions where protein QTL have been mapped, but the genomic regions containing Satt431 on LG J and Satt551 on LG M have not been reported in previous linkage mapping studies. Furthermore, these new putative protein QTL do not map near any QTL known to affect maturity. Since biased population structure was known to exist in the original association analysis population, association analyses were also conducted on two similar but independent confirmation populations. Satt431 and Satt551 were also significant in those analyses. These results suggest that our association analysis approach could be a useful alternative to linkage mapping for the identification of unreported regions of the soybean genome containing putative QTL.
KeywordsAssociation mappingGlycine maxLinkage disequilibrium (LD)Population structureQuantitative trait loci (QTL)Seed protein contentSimple sequence repeat (SSR)
Although traditional breeding methods have been successfully used to increase seed protein content in soybean [Glycine max (L.) Merr.] (Hartwig 1990; Leffel 1992; Chung et al. 2003), the development of new cultivars with high protein levels could be facilitated by the use of marker-assisted selection (MAS) for high protein genes. Molecular markers have been used in linkage mapping studies of segregating populations to identify the quantitative trait loci (QTL) associated with seed protein levels (Diers et al. 1992; Lee et al. 1996; Brummer et al. 1997; Orf et al. 1999; Casanadi et al. 2001). A linkage mapping approach to finding QTL requires time to develop mapping populations and it is necessary to evaluate these populations in multiple environments in order to obtain robust phenotypic data. Additional limitations result from the small size of most mapping populations and from the limited opportunities for crossing over to occur during the development of these populations (Hansen et al. 2001; Stella and Boettcher 2004; Gupta et al. 2005). The number of meioses that occur during the development of most mapping populations is small, and the limited recombination makes it difficult to map QTL with much precision (Cardon and Bell 2001).
Association analysis based on linkage disequilibrium (LD) has recently emerged as an alternative approach to mapping QTL and genes associated with some human diseases (Pritchard and Przeworski 2001; Reich et al. 2001; Weiss and Clark 2002). LD is defined as the nonrandom association of alleles at different loci (Flint-Garcia et al. 2003). In addition to being useful for QTL mapping (Meuwissen and Goddard 2000), association analysis can sometimes identify the mutations that cause specific phenotypes (Palaisa et al. 2004). Target gene regions are expected to be small relative to those in specific mapping populations, since association studies benefit from all of the generations of recombination that followed the origination of a specific allele mutation (Cardon and Bell 2001; Gupta et al. 2005). If LD exists between a marker and a locus associated with a trait, then specific marker alleles or haplotypes (i.e., genotype combinations at groups of linked markers) can be associated with phenotypic values at a high level of statistical significance (Cardon and Bell 2001). In conducting association analysis, however, one must be wary of spurious associations between candidate markers and phenotypes that can result from the presence of population structure (Prichard et al. 2000b). False positive associations may occur if the frequency of a certain phenotype varies across subpopulations, thereby increasing the probability that sampling from different subpopulations will not be random. As a result, a marker allele that occurs at a high frequency in a preferentially sampled subpopulation may appear to be associated with trait of interest even though it is not linked to a real QTL.
Association studies have been effectively used to identify the genetic causes of several human diseases (Pritchard and Przeworski 2001; Reich et al. 2001; Goedde et al. 2002; Weiss and Clark 2002; Twells et al. 2003). LD has also been useful in the fine mapping complex disease genes (Terwilliger and Weiss 1998; Kruglyak 1999; Jorde 2000), and is widely used in genome-wide association studies (Risch and Merikangas 1996; Reich et al. 2001). Association studies have been used to map plant QTL using both candidate gene and genome scan approaches (Flint-Garcia et al. 2003; Gupta et al. 2005). Hansen et al. (2001) used LD and 440 AFLP markers to map the bolting (B) gene in sea beet (Beta vulgaris ssp. maritima). Two markers with significant LD were identified as being linked to the B gene. Association mapping has also been tested in a gene bank collection of 600 potato (Solanum tuberosum) cultivars (Gebhardt et al. 2004). A highly significant association with resistance to late blight and plant maturity was detected with PCR markers specific for R1, a major late blight resistance gene. Ivandic et al. (2002) used 33 SSR markers to study association with flowering time and several other adaptive traits in barley (Hordeum vulgare L.). SSRs significantly associated with flowering time under different growing regimes were identified, and most associations could be accounted for by markers linked to genes for early maturity.
Although several LD maps have been constructed for the human genome, construction of LD maps of plant genomes is just beginning (Gupta et al. 2005). Remington et al. (2001) measured LD across the maize (Zea mays L.) genome through the analysis of 47 SSR markers, and reported rapid decay of LD over 12 kb at the su1 locus. This study also suggested that SSR markers were more efficient than single nucleotide polymorphisms (SNPs) for tracking recent population structure, since greater levels of LD were detected between markers than with SNPs, which are considered to be evolutionarily older (Flint-Garcia et al. 2003). Genome-wide LD was measured with 76 accessions of Arabidopsis thaliana that were genotyped at 163 SNPs (Nordborg et al. 2002). LD typically started to decay within 50 kb, although LD did persist for 250 kb in one 500 kb region. The pattern of intrachromosomal LD in barley showed that long-range LD extended up to distances as long as 50 cM with r2 > 0.05, or up to 10 cM with r2 > 0.2 (Malysheva-Otto et al. 2006). In soybean, Hyten et al. (2007) reported that LD extended from 90 kb to 574 kb in the three cultivated G. max groups across the three genome regions referred to as CR-A2, CR-G and CR-J, but less than 100 kb in G. soja group.
To our knowledge, no association studies to detect QTL associated with seed protein content in soybean have ever been reported, though several linkage mapping studies have been conducted to identify protein content QTL (Diers et al. 1992; Lee et al. 1996; Brummer et al. 1997; Orf et al. 1999; Casanadi et al. 2001). The objective of the present study was to evaluate and use LD in an association mapping approach to identify soybean seed protein QTL.
Materials and methods
Plant populations and DNA extractions
A total of 96 soybean accessions from Korea, China, and Japan were obtained from the USDA soybean germplasm collection and were selected on the basis of seed protein content levels listed at the Germplasm Resource Information Network (GRIN) website (http://www.ars-grin.gov/npgs/). This association mapping population (AMP) consisted of an “HP” group of 48 accessions with high seed protein content (50.0–57.4%) and an “LP” group of 48 accessions with low protein content (31.7–38.7%). Besides being selected for their low or high protein levels, accessions included in the HP and LP groups were chosen to represent origin from different geographical regions (China, Korea, and Japan) and maturity groups (MGs) in an attempt to minimize population structure. Some wild soybean (Glycine soja) accessions were included in the groups.
Description of association mapping population and confirmation populations used in this study
No. of accessions
Seed protein content (%)c (Average)
Maturity groupc (Number of accessions)
Originc (Number of accessions)
DNA was extracted from fresh leaf tissue of young seedlings using the protocol described by Shure et al. (1983) with a slight modification. DNA concentration was measured using an F-4500 spectrophotometer (Hitachi Ltd., Ibaragi, Japan) and a Fluorescent DNA Quantification Kit (Bio-Rad, Hercules, CA, USA). All DNA samples were diluted to 20 ng μl−1 with Tris–EDTA buffer (pH 8.0) prior to amplification in polymerase chain reactions (PCR).
For the whole-genome scan mapping approach used with the AMP, 200 SSR markers were chosen on the basis of their locations on the 20 linkage groups (LGs) of the integrated genetic linkage map of soybean (Song et al. 2004). Some markers had been mapped to within 5 cM of previously reported QTL associated with soybean seed protein content in linkage mapping studies (Diers et al. 1992; Lee et al. 1996; Brummer et al. 1997; Sebolt et al. 2000). Primer sequences for the SSR markers were obtained from SoyBase (http://soybase.org), and fluorescently labeled forward primers and unlabeled reverse primers were purchased from Applied Biosystems (Foster City, CA, USA). PCR amplifications were performed in 10-μl reactions containing 2 μl of template DNA, 1.0× PCR buffer, 2.5 mM MgCl2, 100 μM of each dNTP, 0.2 μM each of the forward and reverse primers, and 0.5 units of Taq DNA polymerase (Promega, Madison, WI, USA). The reactions were performed on a PTC-225 Peltier Thermal Cylcer (MJ Research Inc., Watertown, MA, USA). Amplicons were detected using an ABI-Prism 377 DNA Sequencer (Applied Biosystems, Foster City, CA, USA) and 4.8% 19:1 acrylamide:bisacrylamide gels during a 2-h electrophoresis at 750 volts. Marker data were analyzed with GeneScan v.3.0 and Genotyper v.2.1 software from Applied Biosystems.
The molecular variance for maturity and among the three subgroups originating from Korea, China, and Japan within the AMP was tested by using the analysis of molecular variance (AMOVA) method (Excoffier et al. 1992) in GenAIEx version 6 (Peakall and Smouse 2006).
The AMP was analyzed for possible population structure with the STRUCTURE program (Pritchard et al. 2000a) using the admixture model and the non origin-base model. For calculating an accurate number (K) of subpopulations inferred, five independent runs were performed at K levels, ranging from K = 2 to K = 6. Both the length of burn-in period and the number of iterations were set at 200,000.
LD values (r2) between SSR loci on the same LG were calculated using the software package TASSEL (http://www.maizegenetics.net) without the rapid permutations test. The pairs of loci were considered to be in significant LD if P was <0.01. The estimated genetic distance (cM) between loci was inferred from the public USDA map (Song et al. 2004).
In the whole-genome scan approach that we used with the AMP, associations between markers and phenotypes were tested by calculating differences in allele frequencies between the LP and HP groups at each of the marker loci. Differences in allele frequencies were compared statistically using contingency tables with counts of alleles for the LP and HP groups. For all alleles at a SSR locus, probability (P) values were calculated for the differences in allele frequency distributions between the two groups at each marker locus.
Analysis of genetic diversity and population structure
Analysis of molecular variance for maturity group and geographic origin
Est. Var. c
Level of linkage disequilibrium among intrachromosomal SSR loci
Evaluation of LD values for the genetic distance between loci pairs in all 96 accessions
Genetic distance between loci pairs (cM)a
No of loci pairs in LDb
Sum of loci pairs (no.)
Freq. of loci pairs in LD (%)
Association mapping for seed protein QTL
SSR markers showing a significant difference of allele frequency between high and low protein population (P < 0.0001)
Map position (cM)a
QTL reported by linkage analysis
Map position (cM)a
Orf et al. (1999)
Brummer et al. (1997)
Brummer et al. (1997)
Sebolt et al. (2000)
Lee et al.(1996)
Lee et al.(1996)
Diers et al. (1992)
Lee et al. (1996)
Specht et al. (2001)
Of the 11 putative QTL, 9 were located in regions where protein QTL have been previously mapped using linkage analysis. For example, Satt564 on LG G is about 10.4 cM away from RFLP marker A890_1 (R2 = 15.6%; Brummer et al. 1997), and Satt159 on LG N maps approximately 3.2 cM away from RFLP marker A071_2 (R2 = 11.2%; Lee et al. 1996). These results suggest that the association analysis approach that we used in this study was effective for the detection of QTL associated with seed protein content.
Several markers with significant differences in allele frequency distribution between the LP and HP groups were located in two regions where QTL associated with protein have not been reported. Satt431 on LG J and Satt551 on LG M were not in the vicinity of known seed protein QTL (Table 4).
To investigate the possibility that some maturity QTL were misidentified as putative protein content QTL, known maturity QTL were surveyed using the Soybean Breeders Toolbox (http://soybase.org). Of 22 seed protein QTL previously detected by linkage analysis, 9 were within 30 cM of a maturity QTL. In addition, 3 of the 11 putative QTL identified by our association analysis were located near QTL for maturity. Interestingly, Satt431 (LG J) and Satt551 (LG M), from the newly identified genomic regions with putative protein content QTL, do not map close to any known maturity QTL.
Confirmation of markers for seed protein content
Confirmation for the selected markers detected by association analysis
For accurate association mapping based on LD, diverse populations are required. In this study, the structure of the genetic diversity based on origin (Korea, China, and Japan) and maturity group was tested by AMOVA. Only a relatively small portion (9%) of the molecular variation was explained by the geographical origin of the accessions. However, about 19% of the molecular variation was accounted for by maturity group of representative accessions (Table 2). The AMOVA indicated that the accessions are highly structured in this study. In fact, although the AMOVA evidenced significant differences among accessions grouped on the basis of their maturity group and origin, a high degree of variability, 81% for maturity group and 91% for origin, was also detected within each group.
Model-based clustering analysis of the 96 accessions in the AMP revealed complex genetic relationships among the entire set of accessions. The Chinese subpopulation was divided into two different groups according to maturity. Also, four Japanese accessions were separated from the main subpopulation because they were wild species with a small seed size. The 96 accessions used for association analysis were split into six distinct subpopulations through comparison of their origin and other agronomic traits at K = 6. Thus, the three main subpopulations were roughly detected in our population based on three distinct origins and the three more subdivisions were added, suggesting the existence of population structure (Fig. 1).
The fundamental idea of a population-based method is to separate accessions obtained from a mixed population into several unstructured subpopulations and to determine the association between marker alleles and phenotypes in the homogeneous subpopulations (Prichard et al. 2000b; Gupta et al. 2005). In addition, spurious associations are not considered likely when the accessions related to the particular phenotypes are not biased towards specific subpopulations, although population structure is present (Pritchard et al. 2000a; Cardon and Palmer 2003; Malysheva-Otto et al. 2006; Ostrowski et al. 2006). The six subpopulations identified by our analysis of population structure in our study indicated the distinct subdivisions on the basis of origin and maturity group. Also, accessions associated with high protein content remained in most subpopulations without biased distribution towards particular subpopulations. Therefore, the population used in this study was thought to be applicable to association analysis, even if some population structure is present.
An obvious relationship was observed between the linked level of loci pairs and the level of LD. Moreover, recombination effects for the LD level were inferred indirectly (Table 2). In our study, the intrachromosomal LD was up to 50 cM with r2 > 0.1, or 10 cM with r2 > 0.2 at P < 0.01 for all of 96 accessions (Fig. 2). Extensive LD has also been reported in other selfing species. Malysheva-Otto et al. (2006) reported that intrachromosomal LD extended up to 50 cM with r2 > 0.05, or up to 10 cM with r2 > 0.2 in 953 barley accessions, and 4 cM LD persists in sorghum (Deu and Glaszmann 2004). Interestingly, intrachromosomal LD extended 50–100 cM with r2 > 0.05 in all of 96 accessions. Although this level of LD persistence is considered to be high, this long-distance LD has also been reported in several isolated local populations of Arabidopsis accessions up to 50–100 cM with r2 > 0.2 (Nordborg et al. 2002). Additionally, long-distance LD of up to 100 cM with r2 > 0.1 was detected in the population of European two-row spring barley (Kraakman et al. 2004). However, it was proposed recently that the cutoff level for useful levels of LD in plants should be limited to r2 > 0.1 (Malysheva-Otto et al. 2006).
The number of markers required to cover the genome in an association study is determined by the extent of LD (Flint-Garcia et al. 2003; Malysheva-Otto et al. 2006). Therefore, 150–300 markers should be adequate to conduct preliminary whole genome association studies in soybean (about 3,000 cM). This is much fewer than what would be required for other species or populations with a less LD.
In our study, two groups of soybean accessions with either high or low seed protein content were used for the association analysis. Analyses to test for population structure were prompted by concerns about unbalanced representation of maturity groups in the high and low protein subpopulations. A survey of the more than 15,000 accessions in the USDA germplasm collection was conducted to select two groups of 48 accessions from a larger group of 300 accessions with either high or low protein content. In order to reduce or eliminate the potential effects of population structure in biasing the statistical analyses, an effort was made to include accessions from various geographical locations and MGs in each of the two protein groups. Selection of 48 individuals to represent diverse geographical origins (Korea, China and Japan) in each group was easily accomplished, but it was difficult to balance representative MGs between the high and low protein groups with the pool limited to 300 accessions. However, even when it was expanded to 500 accessions, representation of the various MGs in each protein group remained unbalanced, thus contributing to population structure. We attempted to address this limitation by retesting significant markers after genotyping accessions in high and low protein groups from two confirmation populations that were independent of each other and the original AMP.
Detection of seed protein content QTL was done by testing for significant differences in allele frequencies between the low and high protein groups (Table 4). As G. soja shares common alleles with G. max at the seed protein QTL, allele data were included in the association analysis. A similar association analysis study using SSR markers was conducted to identify genes associated with multiple sclerosis (MS) in humans (Goedde et al. 2002). Four markers in the HLA major histocompatibility complex region associated with MS showed a significant difference in allele frequencies between MS cases and controls.
Case–control studies have been widely used to examine genetic risk factors for complex diseases in human genetics. The most important issue in case–control studies is selection of two well-defined groups representing patients and unaffected controls (Lewis 2002; Ma et al. 2006). Groups of soybean accessions with either high or low seed protein content were used for our association study instead of using high and normal protein content groups. In other words, this study was initially designed as a case–case study to detect both of positive and negative genes controlling protein content together in our extremely selected populations. The benefits of our study methodology are that predominant alleles associated with high or low seed protein content can be simultaneously compared and evaluated for statistical differences in allele frequencies. Case–case studies or case–case–control studies have sometimes been performed for human diseases (Ma et al. 2006; Potoski et al. 2006; Robert et al. 2006). In these studies, they investigated the risk factors associated with disease in two well-defined patient groups.
When association studies are performed using multiple-allele markers like SSRs, one concern is how to treat rare alleles (Lewis 2002). Inclusion or exclusion of data for rare alleles in our present association analysis had little effect on the level of significance of the markers.
In conclusion, analysis of population structure based on model-based clustering method showed the existence of genetic diversity in our plant materials, although population structure was present due to maturity group and origin. Also, long range of LD estimated in this study demonstrates the potential for genome-wide association mapping with fewer markers in soybean. After maturity QTL listed at the Soybean Breeders Toolbox were positioned on our soybean SSR genetic linkage map (Fig. 4), 9 of 22 seed protein QTL were near or very close to QTL for maturity. Most of maturity QTL identified by linkage analysis seems to overlap with QTL for protein content, indicating the biological correlation between maturity and seed protein content. This could therefore affect the ability to identify QTL for seed protein content in the association analysis. However, of the 11 SSR markers showing significance between high and low protein groups in this association analysis, only three were mapped close to a known maturity QTL. The other eight markers, including the from genomic regions in which protein QTL had not been previously identified, could be linked to seed protein content QTL instead of maturity QTL that influence seed protein content. Thus, Satt431 on LG J and Satt551 on LG M in this association analysis could be linked to novel QTL for seed protein, although a possibly bias resulting from a degree of population structure cannot be ignored. The association analysis approach that we used successfully identified a number of SSR markers linked to previously reported QTL associated with soybean seed protein content, and two newly identified markers for seed protein QTL. Also, these QTL were confirmed again by new population sets. Further studies, perhaps using a linkage mapping approach, are needed to confirm whether Satt431 on LG J and Satt551 on LG M are truly linked to previously undetected QTL for seed protein content. Although we would not ignore the limitation of the number of maker used this study and the existence of population structure, these association studies could provide valuable information on identifying possible location of additional QTL in soybean.
This research was supported in part by a grant (code no. CG3121) from the Crop Functional Genomics Center of the 21st Century Frontier Research Program, funded by the Ministry of Science and Technology (MOST) of the Republic of Korea. We also thank the National Instrumentation Center for Environmental Management at Seoul National University in Korea. We express our thanks to Dr. H. Roger Boerma (University of Georgia, USA) for his critical comments of this manuscript.