Introduction

Soybean [Glycine max (L.) Merr.] represented 56% of world oilseed production in 2007 (http://www.soystats.com). Soybeans provide 71% of the edible oils, and have unique benefits in agricultural ecology (e.g. biological nitrogen fixation), documented health benefits (e.g. the anticancer benefits of isoflavones and lunasin) and industrial utilization (e.g. biodiesel). However, soybean production is being challenged due to increasing demands for soybeans in food, feed and value-added products. Soybean breeders around the world are working to improve varieties with better nutritional quality, biotic and abiotic stress tolerance and higher yields. Molecular breeding technologies are increasingly being applied to develop genetic linkage maps and to identify genomic regions influencing traits related to soybean production and seed value, e.g., soybean cyst nematode resistance, high oleic acid, and low linolenic acid.

During the last two decades, both genomic mapping and sequencing methods have advanced significantly to provide tools for scientists to explore genome structure and function in many organisms. Generally speaking, genome mapping relies on genome sequencing to provide location-unique molecular markers to construct a blueprint for understanding genome structure and function. Genetic linkage maps are an essential prerequisite for studying the inheritance of both qualitative and quantitative traits, for molecular breeding and map-based gene cloning, and for genome structure and function studies. Molecular breeding is more effective if the genetic map is densely populated with markers, which would increase the probability of successful trait introgression by transferring defined chromosomal fragments containing target gene(s) and eliminating the linkage drags associated with unfavorable traits. Soybean genome mapping based on the DNA markers began in the early 1990s and numerous genetic linkage maps of soybean have been published in the last decade. The early maps were based primarily on the restriction fragment length polymorphism (RFLP) markers, with more recent maps also including amplified fragment length polymorphism (AFLP) and simple sequence repeats (SSR) and very recently, single nucleotide polymorphism (SNP) markers. In total, several thousand genetic markers (mostly SSR and SNP markers) have been mapped in the past 10 years (Cregan et al. 1999; Wu et al. 2001; Song et al. 2004; Kassem et al. 2006; Choi et al. 2007; Xia et al. 2007; Hisano et al. 2008; Yang et al. 2008). With those markers, more than one thousand quantitative trait loci (QTL) associated with important soybean traits have been identified using different mapping populations (http://www.soybase.org).

The soybean cultivars “Williams 82” and “Forrest”, representing Northern and Southern germplasm in the United States, respectively, have been used as models for soybean genomic research in the same way as the ecotypes Col and Ler in Arabidopsis (Arabidopsis thaliana) (Lister and Dean 1993) or cultivars Mo17 and B73 in maize (Zea mays) (Sharopova et al. 2002). The soybean community in the USA has the majority of soybean genomic tools based on the using the two cultivars, Forrest and Williams 82, which were recently reviewed in the literature (Jackson et al. 2006; Lightfoot 2008). The Williams 82 genome was sequenced by the Department of Energy, Joint Genome Institute (Schmutz et al. 2010), and the 8× scaffold assembly was released to the public (http://www.phytozome.net/soybean). Williams 82 and Forrest not only provide us numerous ‘-mics’ data, but also are important ancestors of some modern cultivars because they carry useful genes (Lightfoot 2008). However, there is no genetic map available to date developed from a population of Forrest and Williams 82.

In this study, we report the construction of a framework genetic map using a core set of RILs selected from a large Forrest × Williams 82 mapping population, to serve as a reference mapping population in the soybean genomics community. This genetic map also offers opportunities to link the existing genetic maps to the “Williams 82” and “Forrest” physical maps (soybase.org; Wu et al. 2004). The mapped markers from the core set of the population would help with the integration of the two physical maps to the draft sequence as framework for genomic research in soybean.

Materials and methods

Plant materials

The mapping population used in this study consisted of 1,025 F2:7 recombinant inbred lines (RILs) derived from the F1 seed from a cross between the cultivars Forrest (female parent) and Williams 82 (male parent). Each RIL tracing to a single F2 plant was advanced to the F7 generation by the single seed descent (SSD) method. In the F8 generation, 1 m rows of each F2:7 RIL were individually bulked. Young leaves from each F2 plant and ~4–8 plants of each corresponding RIL were collected, freeze-dried and ground for DNA isolation. To evaluate the genetic structure of the population before advancing to late generations, a subset of 760 plants in F2 generation were genotyped with 295 SSR markers covering the whole genome.

DNA isolation

The genomic DNA was extracted with an AutoGenprep 965/960 machine (AutoGen, Holliston, MA, USA) using the AGP965/960 Plant DNA Extraction Kit, following the manufacturer’s instructions. The DNA was quantified by a NanoDrop spectrophotometer (NanoDrop Technologies Inc., Centreville, DE, USA) and normalized to 25 ng/μl as PCR working template.

Marker development

New SSR markers were identified from 25,640 Forrest sequences downloaded from the National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov/). Those sequences were analyzed with microsatellite identification tool—MISA (Thiel et al. 2003) to identify SSRs with at least six unit repeats for di-nucleotide repeats, five unit repeats for tri-nucleotide and above. Primers flanking the SSRs were designed using BatchPrimer 3 (You et al. 2008) in such a way that the amplified PCR products ranged between 150 and 400 bp in length.

New Indel markers were identified by blasting the 25,640 Forrest DNA sequences against the soybean Williams 82 genome sequence (http://www.phytozome.net) with a cutoff E value at E-50. If Indels were identified with more than 2 bp of size difference, primers flanking these Indel were designed using the same software with the same parameter settings as SSR primer design (You et al. 2008).

Genotyping

For SSR genotyping, an M13 primer sequence was added to the forward primer to allow detection with a common fluorescently labeled (FAM, PEP, NED or HEX) M13 primer as previously described (Schuelke 2000). The polymerase chain reactions (PCR) were carried out in a final volume of 15 μl on Eppendorf thermocycler with a thermal profile consisting of a 3-min initial denaturation step at 95°C followed by 35 cycles of 30 s at 94°C, 30 s at 52°C and 45 s at 72°C, and a final 7-min extension step at 72°C. Reactions were conducted in 10 mM Tris–HCl (pH 8.4 at 25°C); 50 mM KCl; 0.1% (v/v); 2.5 mM MgCl2; 1% PVP-40; 200 μM dNTPs in the presence of 1 U Taq polymerase; 50-ng genomic DNA; 0.5 μM marker-specific reverse primer; 0.033 μM marker-specific M13-tailed forward primer and 0.5 μm FAM, PEP, NED or HEX-labeled M13 primer. PCR products were 8-plexly pooled into a Whatman 384-well DNA-binding plate for clean-up following the instructions of the manufacturer. The PCR products were eluted with 10 μl of formamide into 384-well plate and run on an ABI 3100 or 3730xl along with LIZ-labeled 500 bp size standard. Genotyping was performed by GeneMapper (3.7) with a 0.5–1.5 ratio as heterozygous allele.

For SNP genotyping, a universal soybean linkage panel (USLP 1.0) containing 1,536 SNP marker (Hyten et al. 2010) was employed to genotype the F2:7 RIL mapping population following an Illumina GoldenGate assay as described by Fan et al. (2006). In brief, after the completion of a multiple-step procedure, such as genomic DNA activation, PCR amplification, hybridization, and washing, array imaging was performed using the Illumina BeadStation (Illumina, San Diego, CA, USA) to generate intensity data. The allele calling for each SNP locus was subsequently conducted with the BeadStudio 3.0 software (Illumina, San Diego, CA, USA). The clusters of homozygous and heterozygous genotypes for each SNP were manually checked for polymorphisms between the two parental lines. The polymorphic SNP loci were then employed for genetic linkage map construction and QTL analysis as earlier described (Vuong et al. 2010).

Mapping population evaluation

A set of tests, including equality test, symmetry test, and representativeness test, using chi-squared test (χ2) as the criteria were used to test the probability that the parameters of the core set of RILs matched the theoretical RIL population, supposed that the distortion of the population was mainly due to the shifting environments during various seasons of generation advancement rather than the skewed selection of gametes and zygotes during meiosis (Gai et al. 2007). The theoretical values of a RIL population were estimated from 200 simulated populations with the software QGENE (Nelson 1997). The parameter settings for the simulated map and populations were adapted with the real values: population size = 376, chromosome number = 20, map length = 2,600 cM, proportion of missing genotypes = 2.5%, proportion of parent A (Forrest) allele = 45.7%, proportion of parent B (Williams 82) allele = 47.1%, and proportion of heterozygous allele = 4.7%. Two scenarios of simulations were performed under fixed chromosome length (150 cM) and variable length with statistical distribution based on parameter settings.

Linkage map construction

Linkage group (LG) marker order and map distance were calculated using the software JoinMap (version 4.0). The segregated markers were grouped in LGs on the basis of an LOD (logarithm of the odds ratio for linkage) score of ≥4.0 and referral to previously reported LGs of the public SSR marker loci. Marker order of known SSR markers in the soybean genome was used for fixed order to construct an LG. Markers were tested for deviation from expected Mendelian segregation by the chi-squared test performed with the JoinMap software under the ‘Locus Genotypic Frequency’ command. Linkage between markers, recombination rate (Q), and map distances were calculated using the Kosambi mapping function.

Results

BES-derived markers for map integration

About 25,640 Forrest sequence entries, including 25,463 BAC end sequences (BES) from the minimum tile path (MTP) clones of the Forrest physical map (Shultz et al. 2006) and 146 Forrest EST sequences posted in NCBI (http://www.ncbi.nlm.nih.gov/) were used to search SSR motif repeats. The whole set of Forrest sequences was also used to identify Indel markers with more than two continuous nucleotide deletions or insertions after filtering out the repetitive elements using RepeatMasker (http://www.repeatmasker.org). The repeat-masked sequences were then used to blast the Williams 82 genome sequence 8× scaffold assembly to in silico map these BESs onto the Williams 82 genome (http://www.phytozome.net/soybean). We extracted all aligned sequences with ≥2 bp difference of continuous nucleotides between Forrest BES and Williams 82 genome sequence. A total of 3,272 putative polymorphic markers, including 581 SSRs with 2–7 bp motif repeats and 2,691 Indels, were identified (Additional File 1). Of the 3,272 markers, 3,015 markers were anchored onto the Forrest physical map (Shultz et al. 2006). Based on the locations of the new polymorphic markers in the Williams 82 genome and the polymorphism test on public SSRs (Song et al. 2004), primer pairs were designed for 175 putative BES-derived markers that could fill the gaps of the genetic map (Additional File 2). Of the 175 tested markers, 127 markers were validated to be polymorphic by genotyping parental lines, showing 71.8% polymorphism. A total of 114 markers derived from Forrest BES or EST were mapped in this study. These 114 markers could serve as new anchors for Forrest physical map integration with the William 82 genome sequence assembly through the Forrest × Williams 82 genetic map. An example is shown in Fig. 1.

Fig. 1
figure 1

Examples of Forrest BAC contig integration to Williams 82 genome assembly via genetic markers derived from BESs. a Forrest FPC contig1772 was anchored by the new genetic marker CG840127 to a 53 kb gap of the Williams 82 physical map integrated on the sequence assembly, b Forrest FPC contig923 was anchored by marker Satt576 onto the LG O (chromosome 10) covering a QTL for Sclerotinia stem rot resistance. This contig anchoring on the LG was confirmed by the new SSR marker MUS0250, but the BES sequence that was used to develop the MUS0250 marker was aligned to scaffold_844, which indicated that scaffold_844 might belong to chromosome 10

Selection of a core set population

A total of 1,025 F2 families were advanced by SSD until F7 generation. The core set comprising 376 RILs was selected according to a framework map with 295 mapped SSR markers which were evenly distributed across linkage groups in the composite genetic map (Choi et al. 2007). The procedures for selecting a core set of RILs were as follows: all markers were sorted in same order as in the genetic map, and the number of crossovers and valid markers in each line was calculated to get the exchange frequency rate of the genome. The lines were sorted and selected in order based on the number of the crossovers, the number of valid markers and the exchange frequency rate. Lines that provided unique crossover breakpoints were selected to be part of the core set. The core set includes all of the crossover breakpoints detected by the 295 SSR markers across the whole genome in order to adequately represent the original large mapping population.

Mapping population evaluation

A total of 1,029 polymorphic markers segregated in this population, including 532 SNP markers from the Universal 1,536 Soy Linkage Panel (Hyten et al. 2010) and 497 SSR markers (349 public SSRs and 148 new SSR and Indel markers developed from this research) were used to evaluate the 376 core set RILs. The RILs can be classified as three categories according their genotypes: families with same parental allele frequency, families with dominant maternal allele frequency (p > 0.5), and families with dominant paternal allele frequency (q > 0.5). Based on the characterizations of the RIL population—equal distribution of two parental alleles, a set of statistical tests, including equality, symmetry, and representativeness tests, were used to examine the coincidence and derivation of the core set of population against the theoretical RIL population (Gai et al. 2007). The equality test examined whether the ratio of the total number of two parental alleles fit the expected theoretical value 1:1. The symmetry test examined the fitness of the 1:1 ratio for the total number of families with maternal allele frequency >0.5 to the number of the families with paternal allele frequency >0.5. The representativeness test examined whether each family as well as the whole experimental sample was a random sample from the corresponding theoretical population, by comparing the rates of families with extreme-biased segregation or of markers with extreme-biased distortion to the ones with the expected rate obtained from the simulated populations (Gai et al. 2007). The results of the tests are shown in Table 1. The equality test showed a larger χ 2c value, indicating unequal genetic contribution from both parents. But the symmetry test showed the different categories of families in the population were balanced since the χ 2c values were <3.84 (5% significance threshold value, when degree of freedom is 1), indicating the numbers of families basically fit a 1:1 ratio. The representativeness test of markers showed some markers had serious distortion so that the extreme-biased rate was larger than the expected value of the simulated population; but the rate of extreme-biased families was 40.4%, lower than the expected values of the simulated population in scenario 1 (45.4%) and scenario 2 (41.0%).

Table 1 A comparison of core set and original population for equality, symmetry and representativeness tests

Framework map construction

Based on the theoretical requirements for genetic population analysis and the three tests as mentioned above, we removed 16 families with the highest χ 2c or with more than 20% heterozygous alleles and 18 markers with the most extreme-biased distortion or with more than 40% missing data. Finally, a total of 986 markers, which included 471 SSRs and 515 SNPs were used to construct a framework map in which a total of 145 newly developed SSR or Indel markers were integrated (Additional File 2). Except for chromosomes 5 and 16 which had two unmerged linkage groups, the other 18 chromosomes constituted one LG for each chromosome (Table 2). Thus, the 986 markers were assembled to 22 LGs with a total genetic map length of 2,723.8 cM (Fig. 2).

Table 2 Summary of the genetic map constructed with 990 markers using a core set of RILs selected from a large Forrest × Williams 82
Fig. 2
figure 2

Genetic linkage map of soybean constructed with SSR, Indel and SNP markers. The linkage map was visualized graphically with MapChart

Discussion

Mapping population

A limitation in QTL mapping is that the markers closely linked to genes controlling important traits are difficult to identify because the meiotic recombination events along chromosomes are limited in a regular mapping population which prevents the map from reaching a high resolution. Even though in theory using a larger mapping population will result in a genetic map with better resolution, in practice, it is not possible to construct a high density genetic map using a relatively large number of plants from a population due to the genotyping cost and mapping efficiency. Usually, a smaller set of RILs in a mapping population is initially used in QTL mapping to identify the genomic regions associated with targeted traits. Once the associated regions are identified, further mapping or map-based cloning can be pursued to identify markers more closely linked to genes. The major disadvantage of mapping genes from a smaller number of individuals is that the associations detected using a smaller sized population sometimes is not representative (Yan et al. 2006). Thus, QTL cannot be confirmed using a larger number of individuals. Thus, to overcome the disadvantages of using a small population, we developed a large RIL population derived from a cross of Forrest and Williams 82 cultivars, which represent southern and northern US soybean germplasm, respectively. We modified the ‘selective strategy’ proposed by Vision et al. (2000) to select an optimal subset (core set) of the population consisting of 376 RILs from the original 1,025 F2 Forrest × Williams 82 families. Using an optimal core set of RILs will reduce the cost in genotyping a large mapping population and increase the mapping efficiency.

Selective mapping strategy has been widely used to improve the efficiency of mapping genetic markers (Vision et al. 2000; Howad et al. 2005; Han et al. 2009; Sargent et al. 2008). It consists of a two-step process in which, first, a subset of highly informative plants is selected based on a framework genetic map of a mapping population, and second, new markers are added to this map using the subset population. In this study, the method used for selection of the core set population was initially used for mapping AFLP markers into a RFLP and SSR map to construct a high density map (Wu et al. 2001), and anchoring chromosome-specific markers to a Leymus AFLP map (Wu et al. 2003). Following the same approach, we selected the core set of individuals according to the most breakpoints and unique breakpoints instead of bin size proposed by Vision et al. (2000). Thus, we can capture all informative individuals carrying all breakpoints detected by the framework map, and the core set of individuals is a good representation of the whole mapping population.

Marker development

Integration of the Forrest physical map with Williams 82 genome assembly via a high density of genetic map is a very important step for utilizing valuable genomic resources developed in Forrest (Lightfoot 2008). BAC library screening has been used for integration of physical map and genetic maps in soybean (Wu et al. 2008). However, this strategy is time-consuming and can generate false positives because of duplication in the soybean genome. In contrast, genetic markers derived from BESs can anchor corresponding BAC clones or contigs onto a genetic map without BAC library screening. Therefore, development of BES-based markers is a promising tool for constructing integrated physical and genetic maps. Although BAC-end sequence-based SSR markers have been successfully used to develop genetic maps in cotton (Frelichowski et al. 2006), soybean (Shultz et al. 2007; Shoemaker et al. 2008), and apple (Han et al. 2009), the polymorphism between two parental lines in a given mapping population is very low, 13–15% (Shultz et al. 2007; Shoemaker et al. 2008), which prevents genetic mapping of larger number of BES-based SSR markers in one mapping population. In this study, we took advantage of the available Williams 82 genome sequence to identify the potential polymorphic SSRs and Indels derived from BESs of the minimum tilling path (MTP) of the Forrest physical map. The polymorphism of 175 putative polymorphic SSRs or Indels was 71.8% (Additional File 2). Therefore, this approach has dramatically improved the efficiency of polymorphic marker identification and mapping in the Forrest × Williams 82 population. Moreover, selection of the SSR or Indel markers for mapping was more specific. The unmerged linkage groups in this study indicated that a big gap existed in chromosomes 5 and 16, respectively, that may reflect either a lack of polymorphic markers in a highly homozygous region or the presence of hot spots of recombination that enlarge the genetic distance corresponding to a short physical distance (Hwang et al. 2009). The new polymorphic markers that could be mapped into gaps in the genetic map were preferentially selected to fill gaps and reduce the tendency of marker clustering. In addition to the identified SSR and Indel markers derived from Forrest BESs, we also predicted SNPs from Forrest BESs by comparing them with the Williams 82 genome sequence. Moreover, using high-throughput Solexa sequencing technology, we have discovered thousands of SNPs between Forrest and Williams 82 (unpublished). These SNPs will be mapped to this population using customized high-throughput SNP genotyping arrays.

Map integration

The Williams 82 soybean physical map has been integrated with genetic map via BAC clone screening with the known genetic markers (Wu et al. 2008) or genetic mapping SSRs derived from BESs of Williams 82 BAC library (Shoemaker et al. 2008). The precise placement of BAC contigs was hindered by multiple hits of FPC contigs caused by soybean genome duplication or false positives. To build a reliable integrated physical/genetic map, a high-resolution genetic map is needed. The markers especially developed from BES of BAC clones consisting of the BAC FPC contigs (i.e. physical map) would be especially useful for map integration. Because the Williams 82 genome sequence is available, the BES itself can help place some of BAC contigs on the Williams 82 genome sequence. Because only the BESs of MTP were available in the Forrest physical map, the power of anchoring BAC clones by sequence alignment was not as great as Williams 82 BACs in which all BAC ends were sequenced. To obtain high-quality integrated physical maps of Forrest and Williams 82, all BAC ends should be sequenced, so that would provide multiple anchors from any FPC contigs and provide more sequences for SSR, Indel or SNP marker identification.

In summary, we selected a core set of a Forrest × Williams 82 population that was used for building a high-resolution genetic map. A framework map was constructed to demonstrate the integration of Forrest physical map with the Williams 82 maps. The results showed that the Forrest BACs indeed could fill gaps or help anchor the unanchored sequence scaffolds to improve the quality and coverage of Williams 82 genome sequence assembly.