Euphytica

, Volume 162, Issue 2, pp 179–191

Association analysis using SSR markers to find QTL for seed protein content in soybean

Authors

  • Tae-Hwan Jun
    • Department of Plant ScienceSeoul National University
  • Kyujung Van
    • Department of Plant ScienceSeoul National University
  • Moon Young Kim
    • Department of Plant ScienceSeoul National University
    • Research Institute for Agriculture and Life SciencesSeoul National University
    • Department of Plant ScienceSeoul National University
  • David R. Walker
    • Soybean/Maize Germplasm, Pathology, and Genetics Research UnitUSDA-ARS, 232 National Soybean Research Center
Article

DOI: 10.1007/s10681-007-9491-6

Cite this article as:
Jun, T., Van, K., Kim, M.Y. et al. Euphytica (2008) 162: 179. doi:10.1007/s10681-007-9491-6

Abstract

Association analysis studies can be used to test for associations between molecular markers and quantitative trait loci (QTL). In this study, a genome-wide scan was performed using 150 simple sequence repeat (SSR) markers to identify QTL associated with seed protein content in soybean. The initial mapping population consisted of two subpopulations of 48 germplasm accessions each, with high or low protein levels based on data from the USDA’s Germplasm Resources Information Network website. Intrachromosomal LD extended up to 50 cM with r2 > 0.1 and 10 cM with r2 > 0.2 across the accessions. An association map consisting of 150 markers was constructed on the basis of differences in allele frequency distributions between the two subpopulations. Eleven putative QTL were identified on the basis of highly significant markers. Nine of these are in regions where protein QTL have been mapped, but the genomic regions containing Satt431 on LG J and Satt551 on LG M have not been reported in previous linkage mapping studies. Furthermore, these new putative protein QTL do not map near any QTL known to affect maturity. Since biased population structure was known to exist in the original association analysis population, association analyses were also conducted on two similar but independent confirmation populations. Satt431 and Satt551 were also significant in those analyses. These results suggest that our association analysis approach could be a useful alternative to linkage mapping for the identification of unreported regions of the soybean genome containing putative QTL.

Keywords

Association mappingGlycine maxLinkage disequilibrium (LD)Population structureQuantitative trait loci (QTL)Seed protein contentSimple sequence repeat (SSR)

Introduction

Although traditional breeding methods have been successfully used to increase seed protein content in soybean [Glycine max (L.) Merr.] (Hartwig 1990; Leffel 1992; Chung et al. 2003), the development of new cultivars with high protein levels could be facilitated by the use of marker-assisted selection (MAS) for high protein genes. Molecular markers have been used in linkage mapping studies of segregating populations to identify the quantitative trait loci (QTL) associated with seed protein levels (Diers et al. 1992; Lee et al. 1996; Brummer et al. 1997; Orf et al. 1999; Casanadi et al. 2001). A linkage mapping approach to finding QTL requires time to develop mapping populations and it is necessary to evaluate these populations in multiple environments in order to obtain robust phenotypic data. Additional limitations result from the small size of most mapping populations and from the limited opportunities for crossing over to occur during the development of these populations (Hansen et al. 2001; Stella and Boettcher 2004; Gupta et al. 2005). The number of meioses that occur during the development of most mapping populations is small, and the limited recombination makes it difficult to map QTL with much precision (Cardon and Bell 2001).

Association analysis based on linkage disequilibrium (LD) has recently emerged as an alternative approach to mapping QTL and genes associated with some human diseases (Pritchard and Przeworski 2001; Reich et al. 2001; Weiss and Clark 2002). LD is defined as the nonrandom association of alleles at different loci (Flint-Garcia et al. 2003). In addition to being useful for QTL mapping (Meuwissen and Goddard 2000), association analysis can sometimes identify the mutations that cause specific phenotypes (Palaisa et al. 2004). Target gene regions are expected to be small relative to those in specific mapping populations, since association studies benefit from all of the generations of recombination that followed the origination of a specific allele mutation (Cardon and Bell 2001; Gupta et al. 2005). If LD exists between a marker and a locus associated with a trait, then specific marker alleles or haplotypes (i.e., genotype combinations at groups of linked markers) can be associated with phenotypic values at a high level of statistical significance (Cardon and Bell 2001). In conducting association analysis, however, one must be wary of spurious associations between candidate markers and phenotypes that can result from the presence of population structure (Prichard et al. 2000b). False positive associations may occur if the frequency of a certain phenotype varies across subpopulations, thereby increasing the probability that sampling from different subpopulations will not be random. As a result, a marker allele that occurs at a high frequency in a preferentially sampled subpopulation may appear to be associated with trait of interest even though it is not linked to a real QTL.

Association studies have been effectively used to identify the genetic causes of several human diseases (Pritchard and Przeworski 2001; Reich et al. 2001; Goedde et al. 2002; Weiss and Clark 2002; Twells et al. 2003). LD has also been useful in the fine mapping complex disease genes (Terwilliger and Weiss 1998; Kruglyak 1999; Jorde 2000), and is widely used in genome-wide association studies (Risch and Merikangas 1996; Reich et al. 2001). Association studies have been used to map plant QTL using both candidate gene and genome scan approaches (Flint-Garcia et al. 2003; Gupta et al. 2005). Hansen et al. (2001) used LD and 440 AFLP markers to map the bolting (B) gene in sea beet (Beta vulgaris ssp. maritima). Two markers with significant LD were identified as being linked to the B gene. Association mapping has also been tested in a gene bank collection of 600 potato (Solanum tuberosum) cultivars (Gebhardt et al. 2004). A highly significant association with resistance to late blight and plant maturity was detected with PCR markers specific for R1, a major late blight resistance gene. Ivandic et al. (2002) used 33 SSR markers to study association with flowering time and several other adaptive traits in barley (Hordeum vulgare L.). SSRs significantly associated with flowering time under different growing regimes were identified, and most associations could be accounted for by markers linked to genes for early maturity.

Although several LD maps have been constructed for the human genome, construction of LD maps of plant genomes is just beginning (Gupta et al. 2005). Remington et al. (2001) measured LD across the maize (Zea mays L.) genome through the analysis of 47 SSR markers, and reported rapid decay of LD over 12 kb at the su1 locus. This study also suggested that SSR markers were more efficient than single nucleotide polymorphisms (SNPs) for tracking recent population structure, since greater levels of LD were detected between markers than with SNPs, which are considered to be evolutionarily older (Flint-Garcia et al. 2003). Genome-wide LD was measured with 76 accessions of Arabidopsis thaliana that were genotyped at 163 SNPs (Nordborg et al. 2002). LD typically started to decay within 50 kb, although LD did persist for 250 kb in one 500 kb region. The pattern of intrachromosomal LD in barley showed that long-range LD extended up to distances as long as 50 cM with r2 > 0.05, or up to 10 cM with r2 > 0.2 (Malysheva-Otto et al. 2006). In soybean, Hyten et al. (2007) reported that LD extended from 90 kb to 574 kb in the three cultivated G. max groups across the three genome regions referred to as CR-A2, CR-G and CR-J, but less than 100 kb in G. soja group.

To our knowledge, no association studies to detect QTL associated with seed protein content in soybean have ever been reported, though several linkage mapping studies have been conducted to identify protein content QTL (Diers et al. 1992; Lee et al. 1996; Brummer et al. 1997; Orf et al. 1999; Casanadi et al. 2001). The objective of the present study was to evaluate and use LD in an association mapping approach to identify soybean seed protein QTL.

Materials and methods

Plant populations and DNA extractions

A total of 96 soybean accessions from Korea, China, and Japan were obtained from the USDA soybean germplasm collection and were selected on the basis of seed protein content levels listed at the Germplasm Resource Information Network (GRIN) website (http://www.ars-grin.gov/npgs/). This association mapping population (AMP) consisted of an “HP” group of 48 accessions with high seed protein content (50.0–57.4%) and an “LP” group of 48 accessions with low protein content (31.7–38.7%). Besides being selected for their low or high protein levels, accessions included in the HP and LP groups were chosen to represent origin from different geographical regions (China, Korea, and Japan) and maturity groups (MGs) in an attempt to minimize population structure. Some wild soybean (Glycine soja) accessions were included in the groups.

To confirm the QTL identified in the initial association study, two confirmation populations (CP1 and CP2) composed of independent accessions grouped on the basis of having either high or low protein contents and diverse origins were again selected from the USDA soybean germplasm collection. Each of the confirmation populations also consisted of 96 accessions divided into an HP group of 48 high-protein accessions and an LP group of 48 low-protein accessions (Table 1).
Table 1

Description of association mapping population and confirmation populations used in this study

 

AMPa

CPb

 

CP1

CP2

HP

LP

HP

LP

HP

LP

No. of accessions

48

48

48

48

48

48

Seed protein content (%)c (Average)

50.8–57.4 (52.3)

31.7–38.7 (36.1)

50.6–57.9 (52.9)

34.5–38.0 (37.0)

50.4–56.0 (52.0)

35.3–38.0 (37.1)

Maturity groupc (Number of accessions)

IV–VI (38)

VII–VIII (10)

I–III (28)

IV–VI (20)

O–OOO (10)

I–III (8)

IV–VI (18)

VII–VIII (12)

O–OOO (9)

I–III (31)

IV–VI (6)

VII–VIII (2)

O–OOO (9)

I–III (8)

IV–VI (19)

VII–VIII (12)

O–OOO (1)

I–III (42)

IV–VI (4)

VII–VIII (1)

Originc (Number of accessions)

Korea (14)

China (24)

Japan (10)

Korea (12)

China (24)

Japan (12)

Asia (42)

Europe (3)

America (3)

Asia (20)

Europe (20)

America (7)

Australia (1)

Asia (42)

Europe (2)

America (2)

Africa (2)

Asia (18)

Europe (18)

America (6)

Africa (2)

Unknown (4)

Association mapping population

b Confirmation population

Data were according to USDA National Plant Germplasm System

DNA was extracted from fresh leaf tissue of young seedlings using the protocol described by Shure et al. (1983) with a slight modification. DNA concentration was measured using an F-4500 spectrophotometer (Hitachi Ltd., Ibaragi, Japan) and a Fluorescent DNA Quantification Kit (Bio-Rad, Hercules, CA, USA). All DNA samples were diluted to 20 ng μl−1 with Tris–EDTA buffer (pH 8.0) prior to amplification in polymerase chain reactions (PCR).

Genotyping

For the whole-genome scan mapping approach used with the AMP, 200 SSR markers were chosen on the basis of their locations on the 20 linkage groups (LGs) of the integrated genetic linkage map of soybean (Song et al. 2004). Some markers had been mapped to within 5 cM of previously reported QTL associated with soybean seed protein content in linkage mapping studies (Diers et al. 1992; Lee et al. 1996; Brummer et al. 1997; Sebolt et al. 2000). Primer sequences for the SSR markers were obtained from SoyBase (http://soybase.org), and fluorescently labeled forward primers and unlabeled reverse primers were purchased from Applied Biosystems (Foster City, CA, USA). PCR amplifications were performed in 10-μl reactions containing 2 μl of template DNA, 1.0× PCR buffer, 2.5 mM MgCl2, 100 μM of each dNTP, 0.2 μM each of the forward and reverse primers, and 0.5 units of Taq DNA polymerase (Promega, Madison, WI, USA). The reactions were performed on a PTC-225 Peltier Thermal Cylcer (MJ Research Inc., Watertown, MA, USA). Amplicons were detected using an ABI-Prism 377 DNA Sequencer (Applied Biosystems, Foster City, CA, USA) and 4.8% 19:1 acrylamide:bisacrylamide gels during a 2-h electrophoresis at 750 volts. Marker data were analyzed with GeneScan v.3.0 and Genotyper v.2.1 software from Applied Biosystems.

Statistical analysis

The molecular variance for maturity and among the three subgroups originating from Korea, China, and Japan within the AMP was tested by using the analysis of molecular variance (AMOVA) method (Excoffier et al. 1992) in GenAIEx version 6 (Peakall and Smouse 2006).

The AMP was analyzed for possible population structure with the STRUCTURE program (Pritchard et al. 2000a) using the admixture model and the non origin-base model. For calculating an accurate number (K) of subpopulations inferred, five independent runs were performed at K levels, ranging from K = 2 to = 6. Both the length of burn-in period and the number of iterations were set at 200,000.

LD values (r2) between SSR loci on the same LG were calculated using the software package TASSEL (http://www.maizegenetics.net) without the rapid permutations test. The pairs of loci were considered to be in significant LD if P was <0.01. The estimated genetic distance (cM) between loci was inferred from the public USDA map (Song et al. 2004).

In the whole-genome scan approach that we used with the AMP, associations between markers and phenotypes were tested by calculating differences in allele frequencies between the LP and HP groups at each of the marker loci. Differences in allele frequencies were compared statistically using contingency tables with counts of alleles for the LP and HP groups. For all alleles at a SSR locus, probability (P) values were calculated for the differences in allele frequency distributions between the two groups at each marker locus.

Results

Analysis of genetic diversity and population structure

The molecular variance within origin-based subgroups accounted for about 91% of the total variation, while variance among subgroups accounted for the remaining 9% at P < 0.0001. For maturity group, about 19% of total variation was due to the molecular variance among maturity subgroups (Table 2).
Table 2

Analysis of molecular variance for maturity group and geographic origin

Source

Df

SSa

MSb

Est. Var. c

%

Value

Prob.

Maturity groupd

Among pops.

6

1445.6

240.9

8.1

19

0.191

0.001

Within pops.

185

6301.7

34.1

34.1

81

  

Geographic origind

Among pops.

2

512.8

256.4

3.6

9

0.086

0.0001

Within pops.

189

7227.7

38.2

38.2

91

  

Sum of squares

Mean squares

Estimates of variance

AMP was used for AMOVA

The model-based clustering method was performed using all of the 96 accessions and a total of 150 SSR markers. A maximum log likelihood was attained at = 6. At this level, individual proportions of membership in each group estimated using the multi-allele data set suggested the existence of some population structure based on three origins and MGs. The relatively small value of the alpha parameter (α = 0.032) indicates that most accessions originated from one primary ancestor, with a few admixed individuals (Ostrowski et al. 2006). Clustering bar plots with K = 2 to 6 are shown in Fig. 1. At = 2, all 96 accessions were divided into the two subpopulations. A large portion of accessions corresponding to the mixed MGs (I–VIII, with the exception of II) belonged to one subgroup (green), while the other subgroup (red) revealed the characteristic of an earlier MG (II). All 96 accessions were separated by geographical differences in latitude at = 2. At = 3 and = 4, Korean and Japanese accessions were separated. Chinese accessions were subdivided into two subgroups according to maturity group at = 3, and four Japanese accessions formed a new subgroup at = 4. The most divergent subgroups by origin were formed at = 5, with one from Korea (red), two from China (pink and yellow) and two from Japan (green and blue). New subpopulations by origin were not created at = 6. However, Chinese accessions with late MGs were separated into two subgroups (MG IV and V or later). The two Chinese subpopulations from MG IV (yellow) and MGs V or later (red) contained 79% and 92% of the HP accessions in each subpopulation. All Japanese wild soybeans were in the HP group (orange). In contrast, Chinese accessions from early MGs and 67% of the Japanese accessions were in the low protein group. However, 58% of the Korean accessions were in the HP group, indicating nearly equal distribution of HP and LP among the Korean accessions.
https://static-content.springer.com/image/art%3A10.1007%2Fs10681-007-9491-6/MediaObjects/10681_2007_9491_Fig1_HTML.gif
Fig. 1

Barplots showing genetic diversity structure for 96 soybean accessions using the program STRUCTURE. Each accession is divided into a number of hypothetical subpopulations based on the proportional membership (a vertical bar expressed as %) from = 2 to = 6, with the most divergent subpopulations were obtained at = 6. Each group is represented by a different color as listed: pink (A): Korean accessions; blue (B): Japanese accessions; orange (C): wild species originated from Japan; red (D): Chinese accessions with late MG; yellow (E): Chinese accessions with moderate MG (IV); green (F): Chinese accessions with early MG

Level of linkage disequilibrium among intrachromosomal SSR loci

The squared allele frequency correlations (r2) were obtained by analysis of a total of 665 intrachromosomal loci pairs using the 150 selected SSR markers. Table 3 shows the evaluation of intrachromosomal LD for the four classes subdivided by genetic distance between loci pairs, i.e., tightly linked (<1 cM), moderately linked (1–10 cM apart), loosely linked (11–20 cM) and unlinked (>20 cM), considering that LD was significant at P < 0.01 (Remington et al. 2001; Maccaferri et al. 2005). Out of the 665 assessed loci pairs, 150 had r2 levels greater than 0.05 (about 15.8%). The r2 values ranged from 0.002 to 0.368 for all intrachromosomal loci pairs, with an average of 0.033. The highest scores for the frequency of loci pairs in LD and the highest mean r2 were reported for loci pairs that mapped within <1 cM of each other. However, these two values decreased as the genetic distance between loci pairs increased. In addition, the majority of locus pairs in LD with r2 > 0.05 at P < 0.01 (about 60% of a total loci pairs) were ≤20 cM apart. This reduction indicates that the probability of LD is low between distant locus pairs.
Table 3

Evaluation of LD values for the genetic distance between loci pairs in all 96 accessions

 

Genetic distance between loci pairs (cM)a

Total

<1

1–10

11–20

>20

No of loci pairs in LDb

11

29

24

41

105

Sum of loci pairs (no.)

18

122

125

400

665

Freq. of loci pairs in LD (%)

61.1

23.8

19.2

10.3

15.8

Mean r2

0.062

0.045

0.036

0.027

0.033

a Tightly-linked loci, <1 cM; moderately-linked loci, 1–10 cM; loosely-linked loci, 11–20 cM; unlinked loci, >20 cM

b Loci pairs with r2 > 0.05 at P < 0.01 lever were used

Scatter plots of the LD values based on the r2 value for 96 accessions are shown in Fig. 2. Intrachromosomal LD extended to distances up to 50 cM with r2 > 0.1, or up to 10 cM with r2 > 0.2 for the entire set of accessions. LD values would likely to extend up 50–100 cM if the limited level of the r2 value were lowered to 0.05.
https://static-content.springer.com/image/art%3A10.1007%2Fs10681-007-9491-6/MediaObjects/10681_2007_9491_Fig2_HTML.gif
Fig. 2

The pattern of LD for 150 SSR loci indicating correlations of allele frequencies (r2) value against genetic distance (cM) between all loci pairs

Association mapping for seed protein QTL

To test the ability of our method to detect protein content QTL, we began our investigation by focusing on the interval between Satt496 and Satt239 on LG I, which contains a major protein QTL that has been mapped in several previous studies (Diers et al. 1992; Chung et al. 2003). These two markers and three others on LG I spanned the known location of the QTL. Of the five markers surveyed, only two markers closely linked to a major protein QTL showed significant differences in allele frequencies between the LP and HP groups (Fig. 3). P values for the differences in allele frequency distributions between the two groups were 0.00063 for Satt496 and 0.00014 for Satt239. In comparison, the allele frequency distribution at Satt330, an LG I marker further (about 41 cM) from the QTL, was similar between the two groups (P = 0.41650) (Fig. 3).
https://static-content.springer.com/image/art%3A10.1007%2Fs10681-007-9491-6/MediaObjects/10681_2007_9491_Fig3_HTML.gif
Fig. 3

Allele distributions for LG I SSR markers in high and low protein groups of soybean accessions. (A) shows differences in diversity of amplicon sizes between the groups at markers nearest a known protein QTL (position indicated by red oval on diagram of LG I) that were not observed at markers further from the QTL. (B) and (C) show differences in the distribution of allele sizes in the two groups at a marker locus close to the QTL (B) compared to distribution at a marker locus further from the QTL (C)

Of the 200 SSR markers initially used for the genome-wide scan, only the 150 markers that consistently amplified DNA from all 96 accessions were used for association mapping. SSR markers reported to be linked to protein content QTL in SoyBase were used if they explained more than 10% of the total variation in the trait based on linkage analysis. Association mapping in the present study began by analyzing these SSR markers, but later on additional random markers were also included in the analysis (Fig. 4).
https://static-content.springer.com/image/art%3A10.1007%2Fs10681-007-9491-6/MediaObjects/10681_2007_9491_Fig4_HTML.gif
Fig. 4

Soybean SSR genetic linkage map showing marker positions and estimated map distances (cM; indicated to the left of the vertical bars) based on the consensus linkage map of Song et al. (2004). Twenty-two QTL (R> 10%) for seed protein contents and 13 QTL for maturity group (R> 10%), which were both previously reported by linkage analysis, are indicated by blue and yellow ovals, respectively. Due to lack of markers, some QTL were positioned out of genetic map. Red ovals represent 11 QTL for seed protein content identified by this association analysis. Only QTL with highest R2 value were selectively positioned in this soybean SSR genetic linkage map, if many QTL were located within 30 cM

A total of 11 independent QTL associated with protein content were identified at P < 0.0001 in the association mapping population (Table 4). In cases where several significant markers were located near one another (i.e., within 50 cM) on the same LG, the marker with the highest level of significance was considered to be the one nearest a putative QTL. Nevertheless, we recognize that P values depend in part on the allelic variability at a particular SSR locus, which is unrelated to the marker’s proximity to a QTL.
Table 4

SSR markers showing a significant difference of allele frequency between high and low protein population (P < 0.0001)

Marker

LG

Map position (cM)a

P-value

QTL reported by linkage analysis

Marker

Map position (cM)a

R2 (%)

References

Satt385

A1

64.7

0.000067

T155_1

93.6

15.0

Orf et al. (1999)

Satt268

E

44.3

0.000056

B174_1

30.9

11.1

Brummer et al. (1997)

Satt564

G

57.3

0.000056

A890_1

67.7

15.6

Brummer et al. (1997)

Satt571

I

18.5

0.000000

Satt127

35.3

65.0

Sebolt et al. (2000)

Satt405

J

11.7

0.000003

B166_1

27.7

7.6

Lee et al.(1996)

Satt431

J

78.8

0.000058

Satt242

K

14.4

0.000002

R051_2

31.8

10.2

Lee et al.(1996)

Satt723

L

1.1

0.000009

A023_1

36.7

16.0

Diers et al. (1992)

Satt551

M

95.4

0.000001

Satt159

N

27.1

0.000022

A071_2

30.3

11.2

Lee et al. (1996)

Satt653

O

38.1

0.000017

Satt478

71.1

6.3

Specht et al. (2001)

a The estimated map position (cM) was inferred from the public USDA map (Song et al. 2004)

Two bold markers indicate the presence of previously unreported protein QTL within 50 cM

Of the 11 putative QTL, 9 were located in regions where protein QTL have been previously mapped using linkage analysis. For example, Satt564 on LG G is about 10.4 cM away from RFLP marker A890_1 (R2 = 15.6%; Brummer et al. 1997), and Satt159 on LG N maps approximately 3.2 cM away from RFLP marker A071_2 (R2 = 11.2%; Lee et al. 1996). These results suggest that the association analysis approach that we used in this study was effective for the detection of QTL associated with seed protein content.

Several markers with significant differences in allele frequency distribution between the LP and HP groups were located in two regions where QTL associated with protein have not been reported. Satt431 on LG J and Satt551 on LG M were not in the vicinity of known seed protein QTL (Table 4).

To investigate the possibility that some maturity QTL were misidentified as putative protein content QTL, known maturity QTL were surveyed using the Soybean Breeders Toolbox (http://soybase.org). Of 22 seed protein QTL previously detected by linkage analysis, 9 were within 30 cM of a maturity QTL. In addition, 3 of the 11 putative QTL identified by our association analysis were located near QTL for maturity. Interestingly, Satt431 (LG J) and Satt551 (LG M), from the newly identified genomic regions with putative protein content QTL, do not map close to any known maturity QTL.

Confirmation of markers for seed protein content

Two additional population sets (CP1 and CP2), each divided into groups with high or low protein contents, were used to confirm evidence for QTL detected by our analysis of the original association mapping population. Among the three significant markers identified as being associated with soybean seed protein content, Satt571, the previously identified marker chosen as a control, showed significant P-values at P < 0.05. As in the association study with the AMP, Satt551 and Satt431 were again identified as having an association with soybean seed protein content at P < 0.01 (Table 5).
Table 5

Confirmation for the selected markers detected by association analysis

Marker

LG

P-value

AMPa

CPb

CP1

CP2

Combinedd

Satt571

I

0.000000c

0.00248**

0.00001***

0.00124**

Satt431e

J

0.000058c

0.00011***

0.00075***

0.00043***

Satt551e

M

0.000001c

0.01323*

0.00612**

0.00967**

Association mapping population

b Confirmation population

c The values are obtained from Table 4

d Mean of the value estimated in two population sets

e Newly identified QTL

*, **, ***; Significant at P < 0.05, P < 0.01 and P < 0.001, respectively

Discussion

For accurate association mapping based on LD, diverse populations are required. In this study, the structure of the genetic diversity based on origin (Korea, China, and Japan) and maturity group was tested by AMOVA. Only a relatively small portion (9%) of the molecular variation was explained by the geographical origin of the accessions. However, about 19% of the molecular variation was accounted for by maturity group of representative accessions (Table 2). The AMOVA indicated that the accessions are highly structured in this study. In fact, although the AMOVA evidenced significant differences among accessions grouped on the basis of their maturity group and origin, a high degree of variability, 81% for maturity group and 91% for origin, was also detected within each group.

Model-based clustering analysis of the 96 accessions in the AMP revealed complex genetic relationships among the entire set of accessions. The Chinese subpopulation was divided into two different groups according to maturity. Also, four Japanese accessions were separated from the main subpopulation because they were wild species with a small seed size. The 96 accessions used for association analysis were split into six distinct subpopulations through comparison of their origin and other agronomic traits at = 6. Thus, the three main subpopulations were roughly detected in our population based on three distinct origins and the three more subdivisions were added, suggesting the existence of population structure (Fig. 1).

The fundamental idea of a population-based method is to separate accessions obtained from a mixed population into several unstructured subpopulations and to determine the association between marker alleles and phenotypes in the homogeneous subpopulations (Prichard et al. 2000b; Gupta et al. 2005). In addition, spurious associations are not considered likely when the accessions related to the particular phenotypes are not biased towards specific subpopulations, although population structure is present (Pritchard et al. 2000a; Cardon and Palmer 2003; Malysheva-Otto et al. 2006; Ostrowski et al. 2006). The six subpopulations identified by our analysis of population structure in our study indicated the distinct subdivisions on the basis of origin and maturity group. Also, accessions associated with high protein content remained in most subpopulations without biased distribution towards particular subpopulations. Therefore, the population used in this study was thought to be applicable to association analysis, even if some population structure is present.

An obvious relationship was observed between the linked level of loci pairs and the level of LD. Moreover, recombination effects for the LD level were inferred indirectly (Table 2). In our study, the intrachromosomal LD was up to 50 cM with r2 > 0.1, or 10 cM with r2 > 0.2 at P < 0.01 for all of 96 accessions (Fig. 2). Extensive LD has also been reported in other selfing species. Malysheva-Otto et al. (2006) reported that intrachromosomal LD extended up to 50 cM with r2 > 0.05, or up to 10 cM with r2 > 0.2 in 953 barley accessions, and 4 cM LD persists in sorghum (Deu and Glaszmann 2004). Interestingly, intrachromosomal LD extended 50–100 cM with r2 > 0.05 in all of 96 accessions. Although this level of LD persistence is considered to be high, this long-distance LD has also been reported in several isolated local populations of Arabidopsis accessions up to 50–100 cM with r2 > 0.2 (Nordborg et al. 2002). Additionally, long-distance LD of up to 100 cM with r2 > 0.1 was detected in the population of European two-row spring barley (Kraakman et al. 2004). However, it was proposed recently that the cutoff level for useful levels of LD in plants should be limited to r2 > 0.1 (Malysheva-Otto et al. 2006).

The number of markers required to cover the genome in an association study is determined by the extent of LD (Flint-Garcia et al. 2003; Malysheva-Otto et al. 2006). Therefore, 150–300 markers should be adequate to conduct preliminary whole genome association studies in soybean (about 3,000 cM). This is much fewer than what would be required for other species or populations with a less LD.

In our study, two groups of soybean accessions with either high or low seed protein content were used for the association analysis. Analyses to test for population structure were prompted by concerns about unbalanced representation of maturity groups in the high and low protein subpopulations. A survey of the more than 15,000 accessions in the USDA germplasm collection was conducted to select two groups of 48 accessions from a larger group of 300 accessions with either high or low protein content. In order to reduce or eliminate the potential effects of population structure in biasing the statistical analyses, an effort was made to include accessions from various geographical locations and MGs in each of the two protein groups. Selection of 48 individuals to represent diverse geographical origins (Korea, China and Japan) in each group was easily accomplished, but it was difficult to balance representative MGs between the high and low protein groups with the pool limited to 300 accessions. However, even when it was expanded to 500 accessions, representation of the various MGs in each protein group remained unbalanced, thus contributing to population structure. We attempted to address this limitation by retesting significant markers after genotyping accessions in high and low protein groups from two confirmation populations that were independent of each other and the original AMP.

Detection of seed protein content QTL was done by testing for significant differences in allele frequencies between the low and high protein groups (Table 4). As G. soja shares common alleles with G. max at the seed protein QTL, allele data were included in the association analysis. A similar association analysis study using SSR markers was conducted to identify genes associated with multiple sclerosis (MS) in humans (Goedde et al. 2002). Four markers in the HLA major histocompatibility complex region associated with MS showed a significant difference in allele frequencies between MS cases and controls.

Case–control studies have been widely used to examine genetic risk factors for complex diseases in human genetics. The most important issue in case–control studies is selection of two well-defined groups representing patients and unaffected controls (Lewis 2002; Ma et al. 2006). Groups of soybean accessions with either high or low seed protein content were used for our association study instead of using high and normal protein content groups. In other words, this study was initially designed as a case–case study to detect both of positive and negative genes controlling protein content together in our extremely selected populations. The benefits of our study methodology are that predominant alleles associated with high or low seed protein content can be simultaneously compared and evaluated for statistical differences in allele frequencies. Case–case studies or case–case–control studies have sometimes been performed for human diseases (Ma et al. 2006; Potoski et al. 2006; Robert et al. 2006). In these studies, they investigated the risk factors associated with disease in two well-defined patient groups.

When association studies are performed using multiple-allele markers like SSRs, one concern is how to treat rare alleles (Lewis 2002). Inclusion or exclusion of data for rare alleles in our present association analysis had little effect on the level of significance of the markers.

In conclusion, analysis of population structure based on model-based clustering method showed the existence of genetic diversity in our plant materials, although population structure was present due to maturity group and origin. Also, long range of LD estimated in this study demonstrates the potential for genome-wide association mapping with fewer markers in soybean. After maturity QTL listed at the Soybean Breeders Toolbox were positioned on our soybean SSR genetic linkage map (Fig. 4), 9 of 22 seed protein QTL were near or very close to QTL for maturity. Most of maturity QTL identified by linkage analysis seems to overlap with QTL for protein content, indicating the biological correlation between maturity and seed protein content. This could therefore affect the ability to identify QTL for seed protein content in the association analysis. However, of the 11 SSR markers showing significance between high and low protein groups in this association analysis, only three were mapped close to a known maturity QTL. The other eight markers, including the from genomic regions in which protein QTL had not been previously identified, could be linked to seed protein content QTL instead of maturity QTL that influence seed protein content. Thus, Satt431 on LG J and Satt551 on LG M in this association analysis could be linked to novel QTL for seed protein, although a possibly bias resulting from a degree of population structure cannot be ignored. The association analysis approach that we used successfully identified a number of SSR markers linked to previously reported QTL associated with soybean seed protein content, and two newly identified markers for seed protein QTL. Also, these QTL were confirmed again by new population sets. Further studies, perhaps using a linkage mapping approach, are needed to confirm whether Satt431 on LG J and Satt551 on LG M are truly linked to previously undetected QTL for seed protein content. Although we would not ignore the limitation of the number of maker used this study and the existence of population structure, these association studies could provide valuable information on identifying possible location of additional QTL in soybean.

Acknowledgements

This research was supported in part by a grant (code no. CG3121) from the Crop Functional Genomics Center of the 21st Century Frontier Research Program, funded by the Ministry of Science and Technology (MOST) of the Republic of Korea. We also thank the National Instrumentation Center for Environmental Management at Seoul National University in Korea. We express our thanks to Dr. H. Roger Boerma (University of Georgia, USA) for his critical comments of this manuscript.

Copyright information

© Springer Science+Business Media B.V. 2007