Extensive genotyping and phenotyping can help harness favorable alleles underlying phenotypic variation in both wild and domesticated germplasm. Soybean [Glycine max (L.) Merr.] is a major crop for dietary protein and oil worldwide. Several hundred soybean genomes have been resequenced (Chung et al. 2014; Lam et al. 2010; Valliyodan et al. 2016; Zhou et al. 2015), and three genome-wide high-density SNP arrays have been developed and used to genotype thousands of soybean accessions (Lee et al. 2015; Song et al. 2013; Wang et al. 2016). These data have been primarily used to compare the patterns of genetic variation between G. max and its wild progenitor (G. soja Siebold & Zucc.) to understand the history of soybean domestication and identify selective sweeps related to the domestication and improvement of soybeans. The data have also been used to identify loci controlling important agronomic traits, such as protein-and-oil and seed-weight traits (Bandillo et al. 2015; Hwang et al. 2014; Zhou et al. 2015). However, further efforts will be required to implement genome-wide association studies (GWAS) (McCarthy et al. 2008) with higher statistical power and mapping resolution in soybean.

Soybean was domesticated ~ 5000 years ago from G. soja, its sympatric wild annual progenitor that is distributed throughout East Asia, including most of China, Korea, Japan, and part of Russia (Hymowitz 2004; Larson et al. 2014). Different regions of China have been proposed as a single center of soybean domestication on the basis of morphological, cytogenetic, and seed protein variation (Broich and Palmer 1981; Hymowitz 2004; Hymowitz and Kaizuma 1981). Multiple centers of domestication including the southern areas of Japan and China have also been proposed based on chloroplast sequence variation (Xu et al. 2002) and archeological records (Lee et al. 2011). However, recent phylogenetic studies using whole-genome resequencing data clearly indicated a monophyletic nature of domesticated soybean (Chung et al. 2014; Lam et al. 2010; Zhou et al. 2015). Of late, molecular studies that used hundreds of markers and accessions have proposed different areas, such as the Yellow River of China (Li et al. 2010) and southern China (Guo et al. 2010) as a center of soybean domestication. Yet, two other studies suggested the domestication center as northern and central China using high-density SNP array data (Wang et al. 2016), or central China surrounding the Yellow River using specific-locus amplified fragment sequencing data (Han et al. 2016).

In most of the previous studies, accessions collected from the Korean Peninsula were underrepresented, although this region is a central region of wild soybean distribution. For example, in the recent genome-wide analyses reported by Wang et al. (2016) and Han et al. (2016), accessions collected from China accounted for 91.8% and 100% of the total samples, respectively. Here, we present an analysis of SNP genotype data from 2824 domesticated, 1360 wild, and 50 putative hybrid accessions as part of an effort to characterize the entire Korean indigenous soybean collection deposited in the country’s National Agrobiodiversity Center. We genotyped soybean accessions using the 180 K Axiom® SoyaSNP array that was developed from soybean genome resequencing data (Lee et al. 2015). Our high-density SNP array data allowed us to evaluate levels of genetic diversity and patterns of population structure. We further attempted to detect genetic loci underlying soybean flower color, seed coat, and domestication traits, as well as provide a refined model of the evolutionary history of domesticated soybean.

Materials and methods

Plant materials and SNP genotyping

The majority of the accessions were from the National Agrobiodiversity Center in Jeonju, Korea, with a small number of accessions provided by individual laboratories (Table S1). The National Agrobiodiversity Center collection consists of approximately 14,000 accessions of improved and landrace cultivars (G. max) and approximately 3700 accessions of wild soybean (G. soja). The Korean germplasm collection substantially overlaps with those of other countries, particularly the United States. Most accessions collected from locations in other countries than Korea have been donated from the US National Genetic Resources Program. Notable exceptions were the 46 wild accessions from Japan, whose accession codes start with ‘B’. These were donated from National BioResource Project in Japan. As our primary goal was to characterize the indigenous soybean collection of Korea, we attempted as much as possible to genotype Korean accessions with even distribution across South Korea. At the same time, we analyzed representative sets of landrace accessions from China, North Korea, and Japan and approximately 400 improved lines, most of which are immediate descendants of ancestral lines of United States soybean cultivars (Gizlice et al. 1994), so that cultivated soybean from Korea could be assessed in the context of worldwide soybean germplasm pool (Table S2). Representative G. soja accessions from China, Russia, and Japan were also selected, allowing the even geographic distribution of wild soybean in each of these countries to be sampled. Initially, we planted approximately 5000 domesticated accessions and 2400 wild soybean accessions, of which approximately 90% were collected in Korea. In results, approximately 35% of the total number of domesticated accessions collected from each of provinces in South Korea were planted, and of the total number of wild accessions, approximately 65% that evenly represented across South Korea were planted. After pure line selection by single seed descent was performed at least two times, DNA samples from approximately 4400 diverse soybean germplasm lines were genotyped. However, because our SNP array data set ended up with smaller number of G. soja accessions from China than those from Korea and Japan, soybean population structure from the representative set was additionally assessed using genome resequencing data (downloaded from Figshare database, from 45 G. soja accessions reported by Zhou et al. (2015).

DNA samples from the ~ 4400 diverse soybean accessions were extracted from a single plant of each accession and were genotyped with the Axiom® SoyaSNP array containing 180,961 SNP sites (Lee et al. 2015). Of the lines genotyped, 4234 with > 97% sample call rate were selected for further analysis. SNPs were scored following the Axiom® Genotyping Solution Data Analysis User Guide ( as described by Lee et al. (2015). Of the 180,961 SNPs, 170,223 were selected on the basis of the development and validation study. Missing data points in the 170,223 SNPs were imputed using BEAGLE 4.0 with default settings (Browning and Browning 2007). The 170,223 SNPs were then used to screen out duplicated and redundant accessions, leaving 3036 non-redundant accessions. After the initial filtration, SNPs with heterozygous rate > 0.02 and minor allele frequency < 0.02 were discarded from the genotype data of the non-redundant accessions, leaving a total of 117,095 high-quality SNPs for the further population analyses (Fig. S1).

Phenotypic data used for GWAS were obtained primarily from field evaluations in the field at National Institute of Crop Science, Suwon, Korea, in 2012 and 2013 (Table S1). The observed phenotype data were converted into binary data. The flower color phenotypes were divided into absence of color (white) or presence of colors ranging from light to dark purple. The seed-coat color phenotypes were divided into absence of colors (yellow or green) or presence of colors ranging from brown to black. Domestication phenotypes were divided into presence (G. max) or absence (G. soja) of domestication.

Population structure and genetic diversity pattern analyses

Principal component analysis (PCA) was conducted to summarize the genetic structure and variation present in the soybean collection using smartpca function in Eigensoft v7.2 (Patterson et al. 2006; Price et al. 2006). We plotted the first three PCs. Neighbor-joining trees were constructed by MEGA7 (Kumar et al. 2016) under the p-distances model. We used a model-based clustering method implemented in ADMIXTURE v1.23 (Alexander et al. 2009) to investigate the population structure of the soybean accessions. We determined the optimal K, the number of clusters based on the smallest cross-validation error calculated from v-fold cross-validation procedure. We plotted the membership coefficient using DISTRUCT (Rosenberg 2004). To investigate the level of genetic diversity maintained in soybean accessions, we calculated the nucleotide diversity (π) using VCFtools v 0.1.13 (Danecek et al. 2011). Genetic differentiation (Weir and Cockerham’s FST) between G. max and each of the G. soja subpopulations was calculated using the VCFtools V0.1.13 (Weir and Cockerham 1984). Hierarchical analyses of molecular variance (AMOVA) in the whole soybean set and the representative soybean set were performed using ARLEQUIN v. (Excoffier and Lischer 2010). The significance of the values for FCT (difference among groups), FSC (difference among populations within groups), and FST (difference among populations) was tested by 1023 permutations.

Genome-wide association studies

We conducted GWAS using PLINK 1.9 (Purcell et al. 2007). Flower color, seed-coat color, and domestication phenotype data were converted to binary data to perform a conditional logistic regression model analysis. Conditional specific SNPs were selected on the basis of minor allele frequencies of SNPs among the groups defined based on PCA analysis. A Bonferroni correction was used to control for the multiple testing problem by adjusting the alpha value from α = 0.01 to α = (0.01/117,095 SNPs) where 117,095 is the number of statistical tests conducted. Therefore, statistical significance of a SNP-trait association was set at 8.54e−8 (−log10P = 7.06). Manhattan plots were produced using the qqman package (Turner 2014). To define linkage disequilibrium (LD) patterns, correlation coefficient of alleles (r2) was calculated for SNPs under the peak regions that exhibited significant association using Haploview 4.2 (Barrett et al. 2005). The confidence interval (CI) method of Gabriel et al. (2002) was used to identify LD blocks.


Overall structure of the genotyped soybean germplasm population

Of the approximately 4400 accessions genotyped, 4234 exhibited > 97% sample call rate. These were used as the total population for characterizing the Korean soybean germplasm (Table S1). The majority of this 4234 set contained accessions from Korea (78.7% G. max and 91.5% G. soja) (Fig. 1a). The rest were G. max landrace and G. soja accessions from China, Russia, North Korea, and Japan and improved lines mostly developed in the USA (Table S2). To eliminate potential confounding effects exerted by hybrids in the comparison of wild and domesticated soybean populations (Vaughan et al. 2008; Wang et al. 2017), we first removed 50 putative hybrid accessions from among the 4234 accessions (Fig. 1b; Table S2). In the field evaluation, most of these 50 accessions showed intermediate morphologies between domesticated soybean (G. max) and its wild relative (G. soja). Furthermore, principal component analysis (PCA) using 170,223 high-quality SNPs showed that the accessions were positioned between two large groups of G. max and G. soja (Fig. 1a). In further support of their suspected hybrid status, 50 accessions showed mixed wild or domesticated genome fractions ranging from 30 to 70% in K = 2 or 3 populations in the ADMIXTURE analysis (Fig. 1b).

Fig. 1
figure 1

Population structure of the genotyped 4234 soybean accessions. a Principal components of SNP variation. PC1 and PC2 indicate score of principal components 1 and 2, respectively. Each of PC1 and PC2 explained 15.6% and 2.7% of variance in the data. Glycine max, G. soja, and hybrids are shown by red, blue, and green dots, respectively. The majority of Korean accessions cluster together within dashed eclipses. b ADMIXTURE plots. The accessions were divided into three groups: G. max, G. soja, and their hybrids. c Distribution of number of redundant accession groups that showed < 1.25% inconsistencies between the SNP calls. G. max and G. soja are shown by white and gray boxes. d Geographic distribution of the collection sites for G. soja accessions

In our previous development and validation study (Lee et al. 2015), the SNP calls genotyped by the SoyaSNP array were highly reproducible, with inconsistencies of ≤ 1.17% observed within pairs of 27 duplicated samples after excluding missing genotypes in either sample. Several sets of near-isogenic isolines were genotyped (Table S3). Single-gene isolines (backcross-derived isolines for single genes) showed approximately 1.0% inconsistencies after excluding missing genotypes in either sample (e.g., 1.16% between Harosoy and L67-153 [Harosoy(6) × Higan]). As expected, a slightly higher level of inconsistency (up to 1.5%) was observed for samples from multiple-gene isolines (e.g., 1.48% between L62-667 [Harosoy(6) × T204] and OT94-51 [OT89-5/L71-802//OT89-6]). However, we occasionally observed that some soybean accessions that had no known pedigree relationship showed < 1.50% inconsistencies (e.g., 0.87% between Williams 82 K and KLS85102). Therefore, we used a 1.25% inconsistency value as the cutoff to remove redundant or highly similar accessions from groups of duplicates or near-isogenic lines. The same cutoff value was applied to filtration of wild soybean accessions. In each of the genotype duplicate sets, an accession with a sample call rate ≥ 99% was preferentially retained. For each of the near-isogenic line groups, the recurrent parent or representative single-gene isoline in case of the absence of parents was retained. Of the 4184 accessions genotyped in this study, 1148 (867 domesticated and 281 wild) were removed (Fig. 1c). The high rate of redundant and highly similar accessions has been frequently reported in worldwide germplasm collections (Food and Agriculture Organization of the United Nations 2010; McCouch et al. 2012). For domesticated soybeans, the major cause in the National Agrobiodiversity Center in Korea is probably the unknowing submission of the same accession with different collection sites and designators because there are many accessions with the same common name but with different collection sites or collectors. For wild soybeans, multiple accessions were collected from a narrow habitat area.

After the filtration, a final set of 3036 genotyped accessions was available for population structure analysis (Table S1 and S3). Only a few accessions were removed from countries other than South Korea. As a result, overall proportion of soybean accessions among countries in this 3036 set was similar to that of the 4234 set (Table S2). In the 3036 set, 1957 were G. max accessions and 1079 were G. soja accessions. Representative G. soja accessions from China, Russia, and Japan remained to evenly reflect the geographic distribution of native G. soja in each of these country regions (Fig. 1d).

Population structure

ADMIXTURE (Alexander et al. 2009) and PCA (Patterson et al. 2006) were used to infer population structure of the 3036 non-redundant soybean set using 117,095 SNPs with heterozygous rate < 0.02 and minor allele frequency > 0.02 (Fig. S1). As observed in the analysis of our total population of 4234 accessions, the 3036 accessions were clearly divided into two large groups, representing G. max and G. soja (Fig. S2). Both the estimated cross-validation (CV) error plot from ADMIXTURE and scree plot from the PCA supported the presence of two large groups (Fig. S2), although the slopes did not level off, which is likely because of subgroupings within the two large groups. The clear separation of G. max and G. soja groups might be expected by the ascertainment bias, which favored selection of G. max SNPs (Lee et al. 2015), and the sampling bias. Unlike the previous observations that the genome diversity level was > twofold lower in the domesticated soybeans relative to that in the wild soybeans, the diversity level of the domesticated soybeans (mean per-site nucleotide diversity (π = 0.189) estimated from the 117,095 SNPs was ~ 1.58-fold lower than that of the wild soybeans (π = 0.298) and nearly two times more domesticated soybean accessions were used for the population structure analyses. The 3036 set also contained excessive numbers of accessions collected in South Korea in both the domesticated and wild soybean groups. Interestingly, in the PCA space constructed with the first two PCs, G. soja accessions from Japan and China that were located at both ends of the G. soja cluster were almost equally close to the G. max cluster. However, in the PC1 and PC3 plot, accessions from Japan were closest to the G. max cluster and accessions from China were the most distantly related to the G. max cluster (Fig. S2).

When we analyzed G. max and G. soja separately, a somewhat distinct subpopulation structure was revealed. Within each of G. max and G. soja populations, CV errors of the ADMIXTURE runs decreased gradually without a steep drop (Fig. S3), whereas eigenvalues of the PCA runs showed steep decrease up to K = 5 (Fig. 2), indicating that there were at least four distinct subpopulations in each of the G. max and G. soja populations. Both the ADMIXTURE and PCA plots from the 1957 domesticated soybeans did not show distinctive grouping (Figs. 2 and S3). The majority of South Korean domesticated accessions (~ 80%) formed a dense subpopulation likely because of recent overcollection. This notion is supported by that the rest of the Korea accessions were well mixed with Chinese, North Korean, and Japanese accessions, which did not show distinct subgrouping on a geographic basis. Notably, the North Korea accessions, which could be considered true landraces because of their collections during the first half of the twentieth century before modern breeding research, were evenly distributed across subpopulations. The improved cultivars were narrowly clustered in the PCA plot, indicating much lower diversity relative to that of the entire domesticated soybeans. The 1079 wild soybean population showed distinctive subpopulations (Fig. 2 and S3), as shown in the analyses of the entire 4234 population. The groupings were consistent with geographic distributions of the collection sites. Korean accessions and Japanese accessions formed a unique subpopulation, respectively. Chinese formed two subpopulations. Five accessions from the Russian border clustered together with those from northeast China. A strong relationship between subpopulations and their geographic distribution was notably exemplified by accessions from Jeju Island located 130 km off the southern coast of the Korean Peninsula; although they belong to the Korean subpopulation, all accessions from Jeju Island formed a subgroup that was the closest to the Japanese subpopulation (Fig. 2c).

Fig. 2
figure 2

Population structures of 1957 domesticated and 1079 wild soybean accessions in the 3036 non-redundant soybean accession set. a Principal components (PC) of SNP variation in the domesticated population. The plots show the first three principal components. The countries of collection or improvement status of the soybean accessions in a and c are represented by two-letter codes—CN, China; IP, improved breeding line; JP, Japan; ND, not determined; NK, North Korea; RS, Russia; and SK (KR), South Korea. b Scree plot of the PC number and their contribution to variance from principal component analysis of the domesticated accessions. c Principal components of SNP variation in the wild population. The plots show the first three principal components. A cluster of accessions from Jeju Island is indicated by a dashed eclipse. d Scree plot of the PC number and their contribution to variance from principal component analysis of the wild accessions

Detection of SNPs associated with domestication history

Domestication is a process of continuous artificial selection of a group of traits, collectively called domestication syndrome. The domestication process has produced selective sweeps with significant reductions in nucleotide diversity (Chung et al. 2014; Doebley et al. 2006; Hufford et al. 2012) on limited regions of the genome (approximately 5–10% of the genome). Numerous recent whole-genome resequencing studies have effectively detected the selective sweeps, which are associated with domesticated genes (Meyer and Purugganan 2013) by examining reduction of diversity (ROD) in windows along chromosomes. However, our SoyaSNP array data are not dense enough to detect the reduction of diversity. Thus, we attempted to detect SNPs associated with domestication using a case–control GWAS method that analyzed binary domestication phenotypes, which were determined by presence (G. max) or absence (G. soja) of domestication.

To test whether our case–control GWAS method enabled to find genes or chromosomal regions underlining binary phenotypes in our 3036 non-redundant population, we chose two highly studied phenotypes—flower and seed-coat colors—which are monogenic and multigenic, respectively. Because our population was highly structured, we performed logistic regression model analysis conditional on a list of subpopulation-specific SNPs. The selected specific SNPs included one perfect domestication-specific SNP with each allele being perfectly correlated with G. max or G. soja membership of soybean accessions, and ten subpopulation-specific SNP within each of the G. max and G. soja populations (Table S4). The flower color phenotypes were divided into absence or presence of anthocyanin deposition colors. The seed-coat color phenotypes were divided into absence or presence of anthocyanin deposition colors. Using the conditional logistic regression model, we detected a broad and strong peak for flower color with the most significant SNP (max − log10P = 80.6) located at 17,877,234 on chromosome 13 (Figs. 3a, S4, and S5c). This peak area contained the W1 locus, which is the major locus determining flower color (Zabala and Vodkin 2007). However, the most significant SNPs were located ~ 500 kb off the position of the flavonoid 35-hydroxylase gene, which is the causal gene of the W1 locus. To understand this region, we estimated pairwise LD for SNPs from 16.8 to 18.8 Mb. A strong LD pattern was observed between all the SNPs under the most significant SNPs; however, no clear LD pattern was observed near the flavonoid 35-hydroxylase gene. Although the result might be due to a SNP density insufficient for a long LD block frequently observed in soybean whose LD decayed to half of its maximum at > 100 kb with genome resequencing data (Chung et al. 2014; Lam et al. 2010; Valliyodan et al. 2016), it has been often observed that a causal gene for a strong peak are not always corresponding with the highest −log10P value (Segura et al. 2012; Yano et al. 2016).

Fig. 3
figure 3

Genome-wide association scans for 3036 soybean accessions for flower color, seed-coat color, and domestication. a Manhattan plot for flower color. The solid horizontal line denotes the Bonferroni-adjusted significance threshold. Chromosomal regions of known genes (T, I, R, W1) or loci (PD05 and qSW) are indicated by dashed vertical lines. b Manhattan plot for seed-coat color. c Manhattan plot for domestication. d Local Manhattan plot (top) and LD heatmap (bottom) surrounding the T locus on chromosome 6. Dashed lines indicate the region of the T locus. Physical locations (kb) are indicated under the Manhattan plot. e Local Manhattan plot (top) and LD heatmap (bottom) surrounding the qSW locus on chromosome 17. A bar indicates the region of the qSW locus

We detected more than 30 peaks exceeding a significant threshold (-log10P ≥ 7.07) for seed-coat color (Figs. 3b, S4, and S5). The top three peaks were correlated with three known major loci (I, R, and T on chromosomes 8, 9, and 6, respectively) that control the deposition of various anthocyanin pigments in seed coat (Yang et al. 2010). The highest peak was generated from a chromosomal region surrounding the inverted CHS gene repeats, which is the causal region of the I locus (Clough et al. 2004). A strong LD pattern was observed at the I locus region. An SNP AX-90432942 on chromosome 6 with the second highest − log10P value = 69.8 was generated from flavonoid 3′ hydroxylase, which is the causal gene of the T locus. Finally, the SNPs near the R2R3 MYB transcription factor gene, a strong candidate gene for the R locus reported by Gillman et al. (2011), were not significantly associated with seed-coat colors. However, the gene is one of the R2R3 MYB transcription factor genes tandemly repeated in this chromosomal region and numerous highly significant SNPs were observed 100-kb away from the proposed R gene. Interestingly, the peak on chromosome 13 is located on the W1 locus that influences seed-coat color in a case of the homozygous recessive it genotypes (Palmer et al. 2004). Considering such a wide range of soybean seed-coat color variations, the detection of numerous minor peaks is not surprising, as reported by Song et al. (2016), although it is still surprising to detect this large number of significant peaks using binary phenotyping data. Nevertheless, we think that some of those minor peaks are inevitably false because the limited number of conditional SNPs could not correct for all inflation of the statistic caused by population substructure.

Since our logistic regression model could readily detect loci associated with flower and seed-coat colors using binary phenotypes, we performed GWAS for domestication syndrome using the binary domestication phenotypes. For this analysis, we excluded the perfect domestication-specific SNP in the list of subpopulation-specific SNPs (Table S4). We detected numerous peaks for domestication syndrome over the genome, as expected from the previous studies (Chung et al. 2014; Hufford et al. 2012; Meyer and Purugganan 2013) that showed that domestication features covered approximately ~ 7% of the crop genome. To examine whether previously detected domestication regions were also detected in the current study, we compared peak locations from our study with selective signals previously detected for two domestication traits, pod dehiscence and seed weight (Figs. 3 and S5d), which are two of the most critical domestication traits among the traits assayed by Zhou et al. (2015). The 190-kb region (PD05) responsible for pod dehiscence was also detected in our study with three highly significant SNPs (-log10P ≥ 20), and the qSW locus for seed weight was detected with > 30 highly significant SNPs (-log10P ≥ 20). Lengths of strong LD blocks under the peaks corresponding to the PD05 and qSW loci were not sufficient to define the selective sweeps that have been known to be > 100 kb, as observed in our LD analyses for the flower and seed-coat color loci. Flower and seed-coat colors analyzed in this study are considered domestication- or diversification-related morphological features because nearly all wild soybean accessions have purple flowers and black seed coats. As expected, their major loci, W1, I, R, and T were also detected with highly significant SNPs (-log10P ≥ 20) in this GWAS for domestication syndrome (Fig. 3).

Extraction of a representative set

Although we used excessive number of Korean wild accessions in the above analysis, Korean accessions formed a unique group from those of neighbor countries. However, since a fundamental assumption of model-based methods, such as ADMIXTRUE and PCA, is that the sample available for analysis is representative of the entire population distribution, sample sizes of subpopulations can substantially affect population stratification and ancestral population inference (McVean 2009; Shringarpure and Xing 2014). To investigate the possibility that excessive numbers of domesticated or Korean soybean accessions might have caused bias in inference of population structure of wild and domesticated soybeans, we obtained representative domesticated and wild soybean sets by selecting one from each of tightly distributed soybean miniclusters in the PCA plots, with a caution that overall distribution patterns are maintained (Fig. S6). For the representative set of wild soybeans, we filtered the population of the tightly distributed Korean accessions and selected 50 diverse Korean wild accessions (Table S2). In addition, four wild soybean anomalies misplaced to subpopulations different from subpopulations predicted by their collection site records were excluded: three Korean and one Chinese accessions (Fig. S7). For the representative set of domesticated soybeans, we selected 50 diverse G. max accessions that represent diversity of 1957 G. max accessions. The results of the AMOVA indicated that the overall genetic structure observed in the 3036 non-redundant soybean population was well represented by the extracted representative set with some decrease in genetic diversity in the G. max population (Table S5). PCA plots from the resultant representative set of 194 soybean accessions showed distribution patterns similar to those from the 3036 non-redundant soybean accessions, although relative sizes of G. max and G. soja distributions in the PCA spaces constructed with the first and third PCs were reversed. The last drop of eigenvalues from the PCA runs occurred between K = 5 and K = 6 (Fig. S8), indicating that there were five distinct subpopulations.

Center of soybean domestication

Because G. soja can be found in situ across most of the East Asia, it is important to establish the population structure, if any, of a diverse collection of G. soja accessions and to associate one or more of these populations with a collection of domesticated G. max varieties. To perform these experiments, we analyzed the population structure of the representative set of 194 soybean accessions with ADMIXTURE and found K = 5 populations based on the estimated CV error plot (Fig. S8). Thus, the soybean accessions were partitioned into one G. max (Gm) and four G. soja (Gs-I, Gs-II, Gs-III, and Gs-IV) subgroups (Fig. 4a–c). Wild accessions from China were divided into two subgroups, Gs-I and Gs-II (Fig. 4d). The Gs-I group showed the least diversity (Table 1), and most of them distributed in the middle region of the Yellow River basin. The Gs-II group was dispersed across northeast China, south China, and the Russian border of northeast China. Interestingly, this grouping result is remarkably similar to that obtained by a recent comprehensive study that showed that, by analyzing a total of 712 G. soja individuals from 40 natural populations in China, Chinese wild soybeans were grouped into two main subgroups, which were one from the Yellow River basin and the other from northeast China and south China (Guo et al. 2012). The Gs-IV distributed in Japan. The Gs-III showed the greatest diversity and distributed in South Korea and most of them appeared to be ancient admixture between Gs-II and Gs-IV. Interestingly, despite clear separation of the Chinese G. soja, diversity level of the combined population of Gs-I and Gs-II was similar to that of Korean or Japanese G. soja. An independent Gm group appeared from K = 2 to K = 5 (Fig. 4c). Interestingly, major genomic fractions of the Gm subgroup consistently appeared as minor genomic fractions of the Gs-IV and minor genomic fractions of the Gm group were a genomic fraction of Gs-I ancestry. The results suggested that after domestication of the Gm subgroup from the Gs-IV subgroup, the Gm subgroup was substantially diversified by introgression of the Gs-I genomic fractions. One of the Gs-I accessions, PI 549046, appeared to be a G. max × G. soja hybrid, although its genomic fraction (~ 22%) from G. max was lower than our hybrid filtration criteria (30% domesticated ancestry), which was less stringent than 20% domesticated ancestry used in other admixture studies (e.g., Wang et al. (2017)).

Fig. 4
figure 4

Identification of the domestication center of G. max. a, b Principal components plots of SNP variation. PC1, PC2, and PC3 indicate score of principal components 1, 2, and 3, respectively. Each of PC1, PC2, and PC3 explained 12.0, 5.2, and 2.6% of variance in the data. Countries of collection of the soybean accessions and species names are represented by two-letter codes—CN, China; JP, Japan; NK, North Korea; RS, Russia; SK, South Korea; Gm, G. max; and Gs, G. soja. A putative hybrid PI 549046 is labeled. c Population structure of 50 G. max (Gm) and 144 G. soja (Gs-I, Gs-II, Gs-III, and Gs-IV) accessions inferred using ADMIXTURE. Each color represents one population. PI 549046 showed ~ 20% of ancestral genomic fractions from G. max. d Geographic distribution of the four G. soja subgroups. Gs-I is red, Gs-II green, Gs-III orange, and Gs-IV blue. e–g Neighbor-joining phylogenetic tree of 194 soybean accessions based on the SNPs genotyped by the 180 K AXIOM SoyaSNP array, with evolutionary distances measured by the p distance. The taxa used in the neighbor-joining tree and bootstrap values from 1000 bootstrap replications at branches are described in Fig. S9. e Phylogenetic tree based on 117,095 SNPs. f Phylogenetic tree based on 108,899 SNPs, which are weakly or not significantly associated with domestication traits. g Phylogenetic tree based on 8196, which are very significantly associated with domestication traits. PI 549046 from group Gs-I clusters between Gm and Gs-IV likely because of contribution of ancestral genomic fraction from Gm

Table 1 Mean per-site nucleotide diversity (π) of Glycine max (Gm) and each of G. soja groups (Gs-I to Gs-IV) in the representative soybean set

We constructed a neighbor-joining tree for the representative soybean set (Figs. 4e and S9). The tree showed that all G. max accessions formed a monophyletic cluster. Although G. max was artificially selected recently, terminal branch lengths were similar to those of G. soja likely because of ascertainment bias that more SNPs were selected from G. max than from G. soja (Lee et al. 2015). The population of the nearest branches, which were basal to the G. max soybean lineage, was G. soja subgroup Gs-IV. Within the Gs-IV that contains wild soybeans from Japan, those from eastern Japan area were closer to the G. max soybean lineage. To measure population differences and similarities, we calculated the fixation index values (FST) (Holsinger and Weir 2009) between G. max and each G. soja population (Table 2). The pairwise FST values ranged from 0.201 to 0.334. The value of FST for the G. max and Gs-IV populations was the smallest, suggesting that G. max was domesticated directly from G. soja subpopulation Gs-IV. The level of population differentiation among G. soja subpopulation was much lower than that between G. soja and G. max, similar to the case of rice (Huang et al. 2012). However, FST values between G. soja subpopulations corresponded with their geographic distances. The regions containing those wild populations that are phylogenetically close with cultivars could be proposed as the domestication region of crops (Matsuoka et al. 2002; Spooner et al. 2005). Thus, our results suggested that soybeans had been most likely domesticated only once in eastern Japan.

Table 2 FST values between Glycine max (Gm) and each of G. soja groups (Gs-I to Gs-IV) and between G. soja groups

Because the number of G. soja accessions from China in our representative set was smaller than those from Korea and Japan, the population structure revealed by our representative set was further resolved by incorporating the SNP data from 62 G. soja accession genomes resequenced by Zhou et al. (2015) By intersecting these SNPs with the set of 117,095 high-quality SNPs selected in this study, we extracted 103,801 SNPs, which were shared between the genome resequencing data and our representative set. Of the 62 accessions, only 45 were incorporated into the representative set because of the high level (eleven accessions, > 20%) of heterozygous SNPs, hybrid (one), and overlapping (five) (Fig. S10 and Table S6). The resultant expanded set contained twelve diverse accessions from Zhejiang, China, and one accession from Taiwan, thus increasing geographic coverage of this study further down to southern China. In results, the diversity level of G. soja accessions from China and its Russian border was similar to that from Korea or Japan (Table 1). Population structure of the expanded set inferred from ADMIXTURE and PCA was quite similar to that of our representative set, although the G. max accessions appeared to be divided into two groups likely because of the high level of heterozygous SNPs from the genome resequencing data (Fig. S11, S12, and Table 2). Phylogenetic analysis and estimated FST values between subpopulations indicated that G. soja accessions collected from eastern Japan were closest to the G. max soybean lineage.

To evaluate the distribution of SNPs associated with domestication syndrome across soybean subpopulations, we divided the 117,095 SNPs into 8196 SNPs highly significantly associated with domestication traits and 108,899 SNPs weakly or not significantly associated with domestication traits. The 8196 SNPs (-log10P > 17) were selected because the previous studies have shown that ~ 7% of the crop genome is domestication related (Chung et al. 2014; Hufford et al. 2012; Xu et al. 2012). Our GWAS also indicated that, in our GWAS population, strong LD extent under the highly significant SNPs in a peak tended to be much shorter than chromosomal extent under all the significant SNPs of the peak. The tree constructed from the 108,899 SNPs (Fig. 4f) was similar to that constructed from the 117,095 SNPs in their overall grouping and branch patterns, except that the branch length and grouping of the G. max clade were slightly different to each other. Grouping patterns in the tree constructed from the 8196 SNPs (Fig. 4g) were similar to those in the trees constructed from both 108,899 SNPs and 117,095 SNPs, except that the putative hybrid PI 549046 moved the closest to the G. max clade. Interestingly, the lengths of the basal and terminal branches for the G. max clade and the G. soja Gs-IV clade in the tree from the 8196 SNPs became distinctively longer than those in the other G. soja clades. The results indicated that initial major artificial selection for soybean domestication was limited to a G. soja group from Japan.


The present study analyzed genome-wide SNP variations obtained from thousands of soybean accessions, the majority of which were collected from the Korean peninsula. The results provide insight into the development of strategies for efficient and directed germplasm use as well as for collection of novel landraces and wild relatives. Population structure and grouping analyses revealed strong correlations between genetic distance and geographic distance in G. soja (wild soybean) populations and weak correlations in G. max (domesticated soybean) populations. G. soja accessions were divided into four distinct subgroups; Gs-I and Gs-II from China and its Russian border, Gs-III from Korea, and Gs-IV from Japan. The results suggest that although the Korean territory is much smaller than Chinese and Japanese territories, the sea-imposed geographic separation among these countries has been a major contributor to the evolutionary divergence of G. soja. Most of the Gs-III group from Korea appeared to be ancient admixtures between Gs-II and Gs-IV, suggesting that G. soja spread from each of China and Japan might be mixed in Korea. Thus, Korean wild soybeans are expected to be rich in terms of diversity than the other countries’ wild soybeans because they alone provide variations from two large subgroups. Interestingly, accessions from Jeju Island off the southern coast of the Korean Peninsula are the closest grouped to the Gs-IV group among members of the Gs-III group, indicating that although our estimated FST values between G. soja groups denied appearance of a new distinct group, more extensive sampling from diverse areas will likely reveal better correlations between geographic and genetic distances among G. soja subpopulations. The G. max population was divided into four subgroups. However, it was apparent that the subgrouping did not reflect geographic origins. Particularly, landraces from North Korea that would be considered true landraces based on their collection time appeared in every subgroup. The majority of South Korean landraces that had been collected recently were grouped together. The majority of improved accessions from the USA were clustered closely together, supporting a previous observation (Hyten et al. 2006). Taken together, our results suggest that while G. soja germplasm will require additional sampling from diverse indigenous areas to expand the germplasm base, G. max germplasm is saturated in terms of genetic diversity. Thus, extensive genotyping and phenotyping of extant G. max germplasm would be the next step to expand the germplasm base of G. max.

Our results provided strong support for a single origin of G. max from eastern region in Japan, although pointing to a specific region in Japan likely requires analysis of more extensive wild and landrace soybean accessions from Japan. Whether a crop species stems from a single domestication event or from multiple independent domestications has been consistent with whether the domesticated species are monophyletic or polyphyletic, respectively, in the phylogenetic trees constructed from both the domesticated and wild progenitor species. Although diversity of chloroplast DNA, which represents maternal lineage of soybean, revealed multiple lineages of domesticated soybeans, analyses of recent genome-wide soybean variation data (Chung et al. 2014; Guo et al. 2010; Lam et al. 2010; Zhou et al. 2015) consistently showed the monophyletic nature of G. max, as observed in this study. In other words, recent soybean phylogenetic studies collectively indicated a single origin of G. max. The best examples of monophyletic grouping are wheat and barley, which appear to have been domesticated once from their wild ancestors in the Fertile Crescent (Badr et al. 2000; Ozkan et al. 2002). The origin of barley was further supported by the genome sequences of five 6000-year-old barley grains (Mascher et al. 2016). In cases of rice and common bean that showed polyphyletic groupings, single or multiple regions of origin of these crop species are still contentious (Bitocchi et al. 2012; Huang et al. 2012; Molina et al. 2011). This controversy may have arisen because most modern wild accessions studied represent descendants of ancient feralization of admixed accessions that resulted from hybridization events between domesticated species and wild species populations after domestication (Wang et al. 2017), indicating that one of the previously thought independent origin regions might be a secondary origin region. The grouping of PI 459046 in this study is a good example that shows how hybrids could mislead inference of relationship between wild and domesticated crop species.

A recent comprehensive study of the archeological records for soybean from Japan, China, and Korea indicated that Japan could have been a source of a large-seeded landrace of domesticated soybean that spread to Korea and subsequently to China (Lee et al. 2011). The archeological records suggest that selection of large seed sizes occurred in Japan (Lee et al. 2011; Nakayama 2015) by 5500 calibrated years (cal) before present (BP) and in Korea (Lee et al. 2011) by 3500 cal BP. Seed size is clearly a domestication trait because the seed size of G. soja is much smaller than that of G. max landraces (Broich and Palmer 1980). However, the archeological data were interpreted to suggest the multiple origins hypothesis of soybean. One particular reason is that the excavated tiny seeds were as old as 9000–8600 cal BP in northern China and 7000 cal BP in Japan. However, the size of the seeds is similar to that of the seeds of present-day wild soybeans, and so would have been quite different from the landraces already grown in China by 2500 BP. Another reason is that the interpretation was greatly influenced by a previous report that diversity of chloroplast DNA SSRs in wild and domesticated soybeans showed evidence for multiple origins of domesticated soybean (Xu et al. 2002). However, as mentioned above, numerous recent genome-wide soybean genome variation studies consistently show a single origin of G. max.

One of the main reasons that the previous studies pointed different regions in China as the center of soybean domestication is likely sampling bias. Our results suggested that wild accessions from China had genetic diversity level almost equal to those from Korea or Japan. However, most previous studies tended to neglect this fact. In an extreme case (Han et al. 2016), no accession from Korea and Japan was used, with the conclusion that central China is the initial domestication region. Another confounding factor is the inclusion of hybrid soybeans from natural mating between G. soja and G. max. Hybrid soybeans were not recognized in many previous soybean population studies, although hybrids between wild and domesticated species have been increasingly regarded as a major problem in studies of crop domestication history (Bitocchi et al. 2012; Wang et al. 2017). Furthermore, it was often assumed that a region in China is a center of soybean domestication because hybrid soybeans are frequently found in China (Han et al. 2016). However, of the 50 hybrids that we removed to avoid their potential confounding effects in this analysis, the majority (36 of 50) were accessions from Korea. The domestication of domesticated plant species from their wild ancestors arose from rapid evolutionary changes in the past 13,000 years of Holocene human history (Diamond 2002; Larson et al. 2014). The list of origins and the list of the most productive areas of most of major crops in the modern world are almost mutually exclusive. This could be explained by that the domestication origin of a crop was merely a region to which the most numerous and most valuable domesticable wild plant species were native. In this respect, our result that shows Japan as the domestication origin of soybean is not totally unexpected one.

Expanding on previous studies that reported genome-wide polymorphism data of soybean germplasm (Bandillo et al. 2015; Chung et al. 2014; Lam et al. 2010; Valliyodan et al. 2016; Wang et al. 2016; Zhou et al. 2015), our results show that accessions intensively collected from Korea, which is a small area of the entire soybean distribution, provide sufficient amounts of data to underpin genome-wide population genetic questions that have been neglected or misled in the context of diversity and domestication panels of extant individuals. Our analysis demonstrates the value of current germplasm collections and how to expand the germplasm base. Furthermore, the findings show that a single major domestication event had occurred in a region of Japan. In addition, the high-density SNP array data enabled detection of domestication-associated SNPs and regions controlling important agronomic traits in a highly accurate manner. This suggests that our results will likely be useful for marker-assisted selection and genomic prediction to utilize unexplored genetic diversity in the soybean germplasm.

Author contribution statement

SCJ and JKM designed the study; SCJ, JKM, SKP, NJ, MSC, STK, and EP collected data; SCJ, JKM, MSK, KL, SRL, and NK performed the analyses; SCJ and JKM wrote the paper.