Background

The cultivated soybean [Glycine max (L.) Merr.] is an economically important crop that grown all over the world. With an average of ~38% protein and ~18% oil content in seeds, soybean provides 69% of dietary protein and 30% of vegetable oil consumption worldwide (www.usda.gov). Modern soybean cultivars were originally domesticated from its wild progenitor (Glycine soja Sieb. & Zucc.) more than 3000 years ago, which was an endemic species in China [1]. Since then, a variety of morphological and physiological changes except for reproductive isolation have occurred that distinguish soybean cultivars from their wild ancestor. Wild soybeans possess much higher adaptability to various natural environments such as drought and salt stress, whereas cultivated soybeans exhibit a bush-type growth habit with large seeds, variable seed coat colors and a stout primary stem. Wild soybeans also differ in the extent of photosynthesis capacity, pod dehiscence and number from cultivated soybeans [2-4].

Heritable changes occurred during plant domestication are being revealed by gene mapping and genomic analyses [5]. The availability of soybean genome and high throughput sequencing technologies provides excellent opportunity to excavate the domestication events and phenotypic diversification at the genome level [6]. Re-sequenced soybeans representing wild and cultivated accessions revealed the nature and extent of genetic diversity in both populations [7-9]. Another research reported a reservoir of genes that were affected by early domestication and modern genetic improvement [10]. Besides, several domestication-related traits have been studied and proposed to be controlled by a small number of genes or several major QTLs [11,12]. However, more analyses are needed to delimit the regions of these QTLs and the footprints of domestication for further gene mapping.

From an evolutionary perspective, if a mutation happens to be beneficial to the species, it will spread to the population immediately by selection [13]. During crop domestication, strong selective pressure caused traits of interests to be fixed in a founder population in quite a short time [14]. Probably, advantageous mutations underlying traits of interests will be subject to fixation in the population. These fixation events differ from those in natural populations, because artificial selection usually acted on alleles that were likely neutral or nearly neutral before domestication. Thus, understanding nucleotide fixation driven by artificial selection is indispensable to complete the picture of soybean evolution. In this research, the published soybean sequencing data were collected to identify single nucleotide variations (SNVs), based on which we detected the genomic regions affected by artificial selection during domestication and improvement. In these footprints, nucleotide fixations that happened in all cultivars were potentially caused by artificial selection, and the genes with these nucleotides were further analyzed, and some of these genes were associated with agronomic traits through functional annotation and QTL meta-analysis. This kind of investigation will provide clues to understand the differentiation of wild and cultivated soybeans. Besides, fundamental practical information will be obtained for future enhancement of cultivars through traditional breeding and transgenic methods.

Results

Estimation of single nucleotide variations among soybean populations

Recently, a set of diverse soybean individuals was sequenced and reported based on NGS platforms [7,8,10]. These soybeans, representing wild and cultivars that mainly consist of landrace and modern elite accessions in East Asia, were selected based on intensive molecular and phenotypic analysis to maximally reflect the genetic diversity of soybeans (Additional file 1: Table S1). It provides us an important resource to depict the genetic diversity of wild and cultivated populations, and to detect the footprints of domestication events. Thus, we downloaded all the short reads of sequencing soybeans from NCBI Short Read Archive under accession numbers SRA020131, SRA009252, SRP015830, and ERP002622. These reads were aligned to the soybean reference genome Glycine max (Phytozome v9) with SOAP2 [15], and were subsequently used to detect SNVs with SOAPsnp pipeline [16]. A total of 9,820,934 SNVs were identified across all accessions, of which 8,168,883 and 5,201,747 appear in wild and cultivars, respectively. Previous reports with the same pipeline have shown that the SNV calling accuracy is 95-99%, with false-positive and false-negative rates to be ~2% and ~3%, respectively [17-19].

To estimate the coverage of these SNVs in the whole soybean germplasm, we employed a random sampling approach to investigate the accumulation of SNVs detected in different accessions (Figure 1A). The end of the SNV curve tends to be flat, which indicates that the SNVs identified here probably reach saturation in soybean germplasm. It is sufficient for as few as 48 accessions to detect 95% of all SNVs in different populations. For cultivated soybeans, only 30 individuals can achieve 95% of SNVs. Approximately 5.2 million SNVs would reach saturation in cultivars, which are far less than those in wild soybeans. In previous work [7], Lam et al reported 6.3 million SNVs in 31 soybeans, while we discovered 2,481,645 more in the same individuals by a larger population. A large number of rare SNVs and those with low allele frequency were missed in former analysis due to strict filtering conditions and a small number of individuals (Figure 1B). Although some very rare SNVs still remain to be discovered, we have identified a substantial majority of the common SNVs in soybeans.

Figure 1
figure 1

Detection of single nucleotide variations in sequencing soybean accessions. (A) Accumulated SNV coverage in cultivars, wild and all accessions; (B) Distribution of missing SNVs in previous report by Lam et al.

Soybean has suffered several genetic bottlenecks, such as early domestication producing lots of Asian landrace, the introduction of few landraces to North America, and modern extensive breeding activities [20]. Subsequently, different level of genetic diversity was reduced during these human-mediated events. More SNVs were identified in wild than in cultivated accessions. Two common statistics used to measure nucleotide diversity are the pairwise divergence per nucleotide θ π [21] and Watterson estimator θ w [22] that corrected for sample size. Whole-genome analysis using these parameters shows a higher level of genetic diversity in wild populations (Figure 2A). Estimated by θ π , the average diversity within wild, landrace and elite cultivars are 3.84 × 10-3, 2.40 × 10-3, and 2.08 × 10-3 per nucleotide, respectively. Considering the cultivars consist of landrace and elites, the average θ π is 2.25× 10-3 in cultivated population. It is notable that the cultivars have retained only 58.6% of the sequence diversity present in wild soybeans, which is lower than previous estimation [7,20]. The genetic diversity was reduced by 37.5% in early domestication and further reduced by 8.3% in genetic improvement.

Figure 2
figure 2

Analysis of genetic diversity and phylogenetic relationship among soybean accessions. (A) Reduction of genetic diversity from wild, to landrace and then to elite soybeans; (B) A neighbor-joining tree; (C) Principal component analysis of soybeans.

The reduction of genetic diversity eroded by artificial selection could also be reflected by phylogenetic tree (Figure 2B) and principle component analysis (PCA, Figure 2C). The wild soybeans shattered in a loose 3-dimension space, while cultivated soybeans formed a relatively tight cluster distinct from the wild individuals. Within the cluster, however, the landraces were not clearly separated from elite cultivars. Some landraces mixed with wild group in our analysis, indicating the early domestication process probably accompanied with considerable gene flow with the wild ancestors. In addition to artificial selection, the genetic erosion can also reflect the narrow genetic base of cultivated soybeans [23]. Analysis of representative wild and cultivated soybeans provides us a comprehensive insight into such evolutionary events that affected population dynamics of soybeans.

Detecting artificial selection and nucleotide fixation in soybeans

The signal of artificial selection could be detected by the loss of genetic diversity, which shaped selective sweeps around beneficial alleles on the genomes [24-26]. To further elucidate the effects of domestication, we detected the genomic regions showing artificial selection signals by genetic bottleneck model [18,19] and population branch statistics [27]. The sequenced accessions except C12 and C16 were grouped into wild and cultivated population to detect selection signals in early domestication process. Using a sliding window approach, we calculated the distribution of θ π and Tajima’s D [28] in wild and cultivated populations along the genome. Regions with significantly lower θ π (Z test, P < 0.05) and lower Tajima’s D (Z test, P < 0.05) in cultivars than that in wild accessions were treated as putative candidates that were affected by early domestication (Figure 3A). However, signals of very recent natural selection could be easily omitted using the above bottleneck model. To detect signatures that shaped in modern crop improvement, we employed an effective method known as population branch statistics. Taking wild soybeans as control, we recalculated the divergence index F st [29] in a sliding window along the genome, based on which we detected significant signals (P < 0.001 after Bonferroni correction) to infer selective footprints from landraces to elite cultivars (Figure 3B). This approach had been shown to be effective in identifying recent artificial selection considering the very short time of modern breeding practice [18]. A total of 598 regions comprising 27.9 Mb genome sequences and 286 regions with a length of 12.7 Mb were affected by early domestication and genetic improvement, respectively. Based on the latest annotation, 2,255 genes with 3,100 transcripts were involved in early domestication, whereas 1,051 genes with 1,462 transcripts were affected in subsequent improvement.

Figure 3
figure 3

Footprints of artificial selection during (A) early domestication and (B) modern improvement.

During the human-mediated breeding process, the strongly selected advantageous mutations could become fixed as these mutations increase in frequency in a population [11,13]. A selective sweep is shaped when a selected mutation goes to fixation, because it reduces variability in the neighboring region where neutral variants are segregating [30,31]. A nucleotide fixation locus was defined when a SNV has a unique genotype in one population while it exhibits polymorphic genotypes in the others. To better understand how genes were affected by domestication events, we primarily focused on those with nucleotide fixation in the selective footprints. We calculated the likelihood of genotypes of each individual and then we allocated the allele type with the maximum likelihood back to each individual as the consensus genotype. After calibration, 101,292 nucleotide fixations were identified in the selective regions in cultivars, which could be potentially caused by artificial selection.

Compared with the genome-wide distribution, nucleotide fixations happened more frequently in the candidate regions of artificial selection (Figure 4). Nucleotide fixation accumulated substantially in cultivars and happened unevenly along chromosomes (Additional file 2: Figure S1), indicating that some chromosomes were more susceptible to be affected by artificial selection. Nucleotide fixation also explains the reduction of genetic diversity in cultivated crops compared with their wild ancestors. We analyzed the allele frequency of SNVs in wild soybeans that were fixed in cultivars, as it represents the initial status of these nucleotide fixations before domestication. The frequency spectrum shows that these SNVs were almost neutral at the beginning of domestication (Additional file 3: Figure S2). Since non-synonymous substitutions may result in a change in functions, they are subject to natural or artificial selection [32]. Of the nucleotide fixation happened in early domestication, 24,316 located in coding sequences and 2,162 of them caused non-synonymous substitutions in 1,188 genes, which altered the amino acid sequences of the proteins. For those loci fixed in modern improvement, 8,065 located in coding sequences with 756 non-synonymous in 489 genes. Apparently, more nucleotide fixations were introduced to cultivars during domestication than those during improvement.

Figure 4
figure 4

The distribution of nucleotide fixation over the genome versus in the selective regions. The window size was set to be 20 kb.

A central question in analyzing the genetic variations in a given population is to explore whether the population has different substructures [29,33]. When analyzing the nucleotide fixations by PCA and phylogenetic tree, two distinct clusters shaped between the cultivars and wild soybeans (Additional file 4: Figure S3). Some noise always exists in inferring phylogenetic relationships among individuals, especially when they are subject to introgressive hybridization [34,35]. Cultivars tightly joined together without noise, supporting the hypothesis of a single rather than multiple evolutionary origins in soybean domestication [36,37].

Nucleotide fixation in wild soybeans

In the process of nucleotide substitution, the fixation of a mutation could spread through the population by random genetic drift or extreme natural selection [38]. In the regions affected by artificial selection, 4,111 nucleotide fixations happened in wild soybeans, which located in 875 transcripts corresponding to 723 genes. Nucleotide fixation happened more frequently in cultivars compared with wild soybeans. To some degree, artificial selection could have promoted the occurrence of fixation events. However, genetic bottlenecks caused by domestication often results in a smaller effective population size of cultivars than that of wild soybeans, which would also contribute to an elevated level of nucleotide fixation. Genes affected by nucleotide fixations were involved in kinds of biological activities as described in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (Additional file 5: Figure S4).

The ability of resistance to pathogen in wild soybeans is much broader than that in cultivated soybeans [23,39]. Interestingly, Glyma20g08290 gene is an ortholog of the disease resistance gene RPM1, which was detected and characterized using molecular genetic approach in Arabidopsis [40]. In soybeans, the RPM1 gene was recently reported being under purifying selection [41]. It serves as an example that natural selection in the wild population also caused nucleotide fixations, although its strength was less than artificial selection.

Agronomic traits affected by selective nucleotide fixation

During domestication, artificial selection is thought to have extremely strong selective pressure on ancestral population for desired phenotypes [42]. The strong selection exerted by human led to an excessive amount of nucleotide fixations during domestication. Artificial selection during soybean domestication has modified a number of traits including seed size, seed color, plant height and prostrate habitat, shaping the domestication syndrome [11,43]. To analyze the effects of nucleotide fixation during artificial selection, we focused on genes within QTLs responsible for domestication-related traits (www.soybase.org), such as oil content, pod number, lodging, plant height, etc. Meta-analysis of these QTLs revealed that 51 of them responsible for 13 traits and 33 for 11 traits were affected by nucleotide fixation in early domestication and modern improvement, respectively (Additional file 1: Table S2, S3). Total QTL regions were narrowed down from 214.9 Mb to 8.1 Mb assisted by selective signals. Analysis of related genes, as well as their orthologs through comparative genomics, could provide information on their potential functions under artificial selection.

As an agriculturally important trait, grain filling makes a significant contribution to grain weight [44]. The gene Grain Incomplete Filling 1 (GIF1) was detected to be responsible and associated with this domestication syndrome [45]. It was reported to encode a cell-wall invertase required for carbon partitioning during early grain filling in rice. A selective gene Glyma03g35520 with nucleotide fixation in domestication is an ortholog of GIF1 and this gene was involved in the carbohydrate metabolism pathway by searching KEGG (Additional file 1: Table S4). Besides, this gene was covered by the QTLs responsible for lodging and pod number. It indicates that Glyma03g35520 is a potential candidate gene, which could be used for further soybean breeding.

Flower and pod numbers per plant are important agronomic traits for grain yield in soybean. To detect the genes involved in flower and pod numbers will help to understand the genetic basis of soybean yield [46]. Two genes, Glyma07g05470 and Glyma07g05480, with nucleotide fixation introduced in improvement, are orthologs of COMT2 gene encoding caffeic acid 3-O-methyltransferase (Additional file 1: Table S5). It differentially expressed in hair cells of growing pod, the possible location of vanillin biosynthesis [47]. Another five selective genes with nucleotide fixation mediated by domestication and improvement encode a kind of protein responsible for the transportation of inositol. These genes were covered by QTLs responsible for seed-coat color, protein and pod number. Previous study showed that the total number of mature pods considerably higher due to the application of inositol, indicating the positive effect in pod number [48]. It suggested that deficiency of lignin biosynthesis resulted in growth reduction and dwarfing [49]. The gene Glyma13g21010 is linked to marker Sat103 that associate with seed weight. As an orthologs of NifU gene, it is required for full activation of nitrogenase catalytic components [50]. NifU protein has been suggested to either mobilize the Fe necessary for nitrogenase Fe-S cluster formation or provide an intermediate Fe-S cluster assembly site [51]. In addition, the gene was reported to be related to seed weight [52]. As nitrogen fixation is imperative in soybean growth, Glyma13g21010 gene could also be a putative candidate gene responsible for seed weight through activating biological nitrogen fixation.

The flowering of soybean represents the transition from a vegetative state to a reproductive state, making a contribution to the yield. Meta-analysis of QTLs identified 14 selective genes with non-synonymous nucleotide fixation responsible for flower number in soybean. Carbon fixation in the process of photosynthesis is pivotal to soybean production. Seven selective genes with nucleotide fixation were involved in photosynthesis or photosystem. Besides, two selective genes Glyma03g36970 and Glyma19g39620 with nucleotide fixation were identified as orthologs of Luminidependens, which is involved in the timing of flowering in Arabidopsis [53].

Interestingly, 63 and 27 selective genes with nucleotide fixation in domestication and improvement, respectively, were annotated to be, or related with transcription factors. Analysis of all the genes subject to artificial selection with agriGO [54] also told an accumulation of transcription factors by Fisher’s exact test and the permutation test (Additional file 1: Table S6). Most of the genes cloned to date that responsible for domestication related traits in crops were proved to be transcription factors, such as teosinte branched 1 (tb1) [55], shattering (sh4) [56], six-rowed spike (vsr1) [57,58], etc. It is probably because the human mediated domestication history was momentary compared with the long natural evolution; changing the transcription factors probably the easiest way happened to affect the agricultural traits of interest. However, putative candidate genes underlying these domestication-targeted phenotypes have diverse functions, which need to be validated by further experiments.

Plant-pathogen interaction affected by artificial selection

Domestication caused complex morphological and physiological changes in soybeans. Annotated by the KEGG and agriGO database, selective genes were associated with different biological functions, among which, plant-pathogen interaction, sequence-specific DNA binding, phenylpropanoid biosynthesis, starch and sucrose metabolism are over-represented categories (Figure 5; Additional file 6: Figure S5). Plant-pathogen interactions are conducted between a pathogen and the host plant. In nature, plants are generally resistant to most invading pathogens due to innate ability to recognize them through successful defenses. When an exception happens, a pathogen would cause diseases in its host [59]. However, pathogens could also cause diseases if they have evolved to evade detection or suppress host defense mechanisms, or both. The effects of plant-pathogen interactions are of particular relevance during early domestication events on agricultural systems [60]. Thus, understanding the genetic basis of why a certain pathogen causes disease in its host plant instead of others has long intrigued and motivated plant pathologists.

Figure 5
figure 5

Functional annotation of selective genes with nucleotide fixation introduced in early domestication and modern improvement.

A total of 37 selective genes with nucleotide fixation were involved in plant-pathogen interactions (Additional file 7: Figure S6). Of them, two selective genes Glyma14g36511 and Glyma08g12560 with nucleotide fixation are orthologs of RPS2 gene. The disease resistance gene RPS2 was isolated using positional cloning and further screen for susceptible mutant [61,62]. The RPS2 protein contains two characteristics of a large family of plant R genes: a nucleotide-binding site and a leucine-rich repeat region [63]. It is consistent with previous report that RPS2 locus exhibit selection signals by examining a worldwide sample of 27 Arabidopsis accessions, and the N-terminal part of the leucine-rich repeat region was a probable target of selection [64].

Cyclic nucleotide-gated ion (CNG) channels are ion channels that function in the pathogen signaling cascade by facilitating Ca2+ uptake into the cytosol [65]. Two selective genes with nucleotide fixation were detected to encode CNG channels. The topology of their proteins was predicted using TMHMM, which is based on a hidden Markov model [66]. The two genes encode transmembrane proteins with nucleotide fixation located outside the membrane (Additional file 8: Figure S7). Besides, eight selective genes are orthologs of transmembrane receptor kinase FLS2, which acts as pathogen-associated molecular pattern signals in triggering the innate immune response [67].

In addition, the category of terpene synthase activity was also enriched with six selective genes involved in (Additional file 1: Table S6). Terpenes are one of the most important defensive plant compounds against herbivores and pathogens [68]. Recently, a new monoterpene synthase gene GmNES was identified and characterized in soybean [69]. Its transcription was up-regulated in soybean when infested with cotton leafworm. Our analysis indicates the gene was possibly affected by artificial selection during soybean domestication.

Discussion

Nucleotide fixation was crucial in soybean divergence

Domestication led to significant morphological divergence between cultivated and wild soybeans. Wild soybean exhibits, for example, twining and vine stem, severer shattering, impermeable seed coats, pod cracking sensitivity, small seeds, and low oleic acid, all of which are seldom observed in cultivars [70]. Deciphering how cultivated soybean have been transformed from its wild ancestor is advantageous both from genetic and evolutionary perspectives. With the available sequencing data, we comprehensively estimated the saturation number of SNVs in soybean germplasm and detected a set of candidate genes showing artificial selection signals. To some degree, analysis of artificial selection and nucleotide fixation unravels the mystery of soybean domestication and subsequent improvement. Based on nucleotide fixation, our analysis supports a single evolutionary origin of domesticated soybean. During domestication, only lines with certain agriculturally important traits were selected, resulting in a genome-wide reduction of genetic diversity or so-called selective sweep in cultivated crops [42,71,72]. One possible explanation for the reduction is that an excess of nucleotide fixation happened in cultivars compared to wild soybeans.

Meta-analysis of QTLs responsible for domestication related traits and the selective genes provided insights into the role of nucleotide fixation played in morphological differentiation between wild and cultivated soybeans. Using comparative genomics, an amount of genes was found to be orthologs of those whose function was validated and responsible for corresponding traits in other plants. Nucleotide fixation happened in those genes responsible for agronomically important traits. Although traditional linkage and association mapping were used to dissect these traits, they failed to detect genetic changes caused by domestication and improvement [73]. Our analysis here provides valuable information for further QTL mapping and will facilitate molecular assisted selection in soybean breeding practice.

Artificial selection accelerates nucleotide fixation

Domestication was an evolutionary process where the characters of interests were selected, such as loss of seed dispersal, higher yield and increasing abiotic resistance. The detection of selective loci during crop domestication contributes to modern breeding efforts and the opportunity to improve genomic selection models [74]. Recently, genome-wide scans based on genetic bottlenecks have been successfully applied to detect footprints of selection in plants by surveying both natural and cultivated species [19,75,76]. Artificial selection of a beneficial mutation will lead to an elevated frequency in a population. Eventually, allele frequencies will be skewed and nucleotide fixation happened after plant domestication. Our analysis focused on to what degree nucleotide fixation was caused by artificial selection during soybean domestication.

More nucleotide fixation happened in cultivars than those in wild soybeans, indicating that artificial selection was much stronger than natural selection. However, the effective population size of cultivated soybeans was substantially reduced during domestication [77], which could make a nucleotide seem to be fixed in cultivars. That mainly explains why nucleotide fixations were observed in cultivars across the soybean genome. Considering nucleotide fixation accumulated in footprints of domestication and improvement, artificial selection probably accelerated the occurrence of fixation in soybean breeding activities. Even thought, some of them could be also caused by the shrinking population size, especially when different haplotypes shaped in those selective sweeps. These fixations are extremely hard to be distinguished in current samples.

Morphological transition can be achieved by a mutation at a single locus [78,79], and artificial selection can rapidly change domestication targeted phenotypes within 20 generations [31,80]. Domestication could be a rapid instead of a slow or gradual process, given strong selective pressures and a suitable genetic architecture. This was supported by the severe reduction of genetic diversity and large selective sweeps. In the process of domestication, any mutations detrimental to the traits of interests were eliminated immediately, whereas those advantageous ones were strongly selected, diffused and eventually fixed in a population. The environments wild soybeans grow in are various and usually harsh, resulting in diversifying selection instead of strong directional selection. What’s more, selection intensity imposed by natural selection was disparate in diverse habitats. These reasons also explain why artificial selection was much stronger than natural selection in crop domestication.

Evolutionary perspective of nucleotide fixation

A long-term goal of crop genomics is to determine to what extent artificial selection impacts genomic variation patterns within and between populations. There are both genetic and statistical approaches to detect signals of hitchhiking caused selective sweeps [13]. The hitchhiking effect is contingent on the nature of genetic variations and how selection acts on them. Generally, there are at least three evolutionary routes by which a novel mutation may fix: drift to fixation for nearly neutral mutation; rapidly sweep to fixation, so-called hard sweep for beneficial mutation; and soft sweep to fixation for those initially neutral but later become beneficial for some reason. Affected by artificial selection, a pre-exist mutation which became beneficial during domestication rapidly increased in frequency toward nucleotide fixation, as what we found in our analysis. When traits of interests during domestication were determined by multiple adaptive mutations at the same locus, artificial selection usually generates soft rather hard selective sweeps. Many studies focus on hard sweeps in which only a single adaptive haplotype was skewed to fixation in the population [81], whereas multiple adaptive haplotypes formed simultaneously in a soft sweep. Lots of nucleotide fixations happened within quantitative traits, indicating the corresponding traits of interests were incrementally changed at various causal loci. As a consequence, these sweeps related with artificial selection are likely to be both soft and incomplete. In soybean, some traits related to yield were selected, such as seed weight, seed blooming and prostrate habit, for which these are usually major QTLs responsible. Nevertheless, during intensive breeding human pursuits quality related traits such as protein content and lipid content, for which there are lots of small effect QTLs responsible. Analysis of nucleotide fixation indicates that more soft selective sweeps happened in extensive breeding than in early domestication in soybean, which still needs further investigation.

Conclusion

We integrated the available sequencing accessions to describe a whole picture of soybean genetic diversity, artificial selection and concomitant nucleotide fixation. There are approximately 9.8 million SNVs in soybean germplasm, of which about 5.3 million reserved in cultivars. The genetic diversity was reduced by 37.5% in early domestication and subsequently reduced by 8.3% in genetic improvement. A total of 2,255 and 1,051 genes were involved in early domestication and subsequent improvement, respectively. Both processes introduced about 0.1 million nucleotide fixations, which contributed to the divergence of wild and cultivated soybeans. Artificial selection probably accelerated the occurrence of nucleotide fixation, which affected some agronomic traits, as well as related biological pathways such as plant-pathogen interaction.

Methods

Data collection and SNP detection

The sequenced soybean accessions representing 31 wild, 15 landrace, and 24 elites were described in several published papers [7-10]. These accessions originate from large ecological regions in China and South Korea. All sequence reads were downloaded in Sequence Read Archive (SRA) under accession number SRP015830, SRA020131 SRA009252, and ERP002622. These reads were then mapped to the soybean reference (Glycine max var. Williams 82, Phytozome v9.0) with SOAP2 software [15]. PCR duplication in each sequencing library was removed before SNV calling.

In the SNV calling process, genotype likelihood of each genomic locus was first calculated with Bayesian theory implemented in SOAPsnp [16]. The genotype with the highest probability at each site was selected with a quality value to create a consensus sequence for each individual. High quality SNVs were obtained with certain criteria such as sequencing depth, copy number (<=1.5), quality value (>20) and the rank sum test.

Detection of artificial selection signals

As described in previous report [10], we used two outlier approaches to detect signals of artificial selection. Using a 20 kb sliding window with a 2 kb step-size, we calculated θ π and Tajima’s D between wild and cultivated groups. Those regions showing significantly low θ π.cultivated /θ π.wild and low D values (Z test, P < 0.05 for both) in cultivars were treated as putative selection signals. Besides, we chose the population branch statistic [27] on the basis of F st to infer the selective footprints from landrace to elite cultivar, considering the very short divergence time between them.

Identification of nucleotide fixation

We screened the SNVs located in the regions showing signals of artificial selection. Short reads of each individual were re-aligned to the reference for individual genotyping at each SNV. The likelihood of individual genotypes was calculated and then the allele type with the maximum likelihood was allocated back to each individual. If a SNV has a unique genotype in all wild soybeans or in cultivars, it will be identified as a nucleotide fixation locus.

PCA and phylogenetic analysis

Using the principal component analysis (PCA), the population subdivision pattern was then inferred [82]. We constructed a phylogenetic tree by a neighbor joining method in the software PHYLIP (version 3.68) [83]. A total of 1,000 replicates generated the bootstrap values.

Enrichment of selective genes

The functions of selective genes were analyzed with KEGG (www.genome.jp/kegg/) and agriGO (http://bioinfo.cau.edu.cn/agriGO/), and the results were displayed using a Cytoscape plugin BiNGO [84]. For enrichment P value (<0.05) was calculated using Fisher’s exact test and Permutation test. For multiple hypotheses testing, false discovery rate correction of Benjamini and Hochberg method was used to reduce false negatives.

Inferring protein topology

We predicted transmembrane protein topology with a hidden Markov model (TMHMM) to infer the protein topology with default parameters [66] (http://www.cbs.dtu.dk/services/TMHMM/).