Background

Peanut or groundnut (Arachis hypogaea L.), belonging to the legume genus, is an important oil, food, and feed crop and is cultivated in more than 100 countries. The annual planting area of peanuts is approximately 21.8 Mha worldwide, with an annual production of 38.6 Mt (http://faostat.fao.org/faostat/collections?subset=agriculture 2011). The peanut production and consumption in China account for approximately 40 % of the worldwide rates, and peanut export from China has accounted for more than 30 % of the global trade since 2001 (http://zzys.agri.gov.cn/nongqing.aspx). Over 50 % of Chinese peanuts produced are crushed for extraction of oil, and peanut oil accounts for 25 % of the total domestic vegetable oil, second only to rapeseed oil. The peanut holds an important status and substantial efforts have been made to develop various types of molecular markers in recent years, such as restriction fragment length polymorphisms (RFLPs) [1, 2], random amplified polymorphic DNAs (RAPDs) [35], amplified fragment length polymorphisms (AFLPs) [6, 7], simple sequence repeats (SSRs) [8, 9], insertions/deletions (INDELs) [10], and single nucleotide polymorphisms (SNPs) [11, 12]. These markers were developed for genetic linkage mapping [13, 14], genetic diversity studies [9, 15, 16], and for use in plant breeding programs [10, 17]. Although many efforts have been performed by several research groups around the world, genetic research and molecular breeding of this plant lag behind those of other crops, such as rice, wheat and rape. Lack of the tools for ideal molecular markers and genomic resources are important factors hampering the development of genetic research and molecular breeding of peanut.

Single-locus markers have many advantages in molecular genetics and breeding studies compared with multi-locus markers [1820]. The alleles of single-locus markers can be assigned to particular genomic loci in diversity analyses, preventing problems of extensive genome duplication and homology within and between different genomes caused by multi–locus markers of polyploidy [21, 22]. A series of diversity parameters can be calculated more accurately for single-locus markers than multi-locus markers, such as the number of alleles, allele frequency and polymorphism information content (PIC) [22]. Molecular markers with only a single-locus can yield accurate genotyping and are more suitable for the subsequent analysis of population structure and linkage disequilibrium (LD), while the genotyping of multi-locus markers is always ambiguous, increasing errors and making these analyses difficult.

Among the various types of molecular markers, SSRs have become the most widely used in genetic maps, gene mapping and marker-assisted selection (MAS) because of their relative abundance, good reproducibility, highly polymorphic nature, codominant inheritance pattern and random distribution in the genome [23, 24]. Based on their locations in the genome, SSR markers are generally divided into genomic SSRs and genic SSRs (or expressed sequence tag (EST)-SSRs) [25]. The usual protocol for the development of genomic SSRs has been the generation of a small-insert genomic library, subsequent hybridization with probes, and the sequencing of candidate clones [2628]. This process is costly, technically complex, time consuming, and labor-intensive. The development of next-generation sequencing (NGS) technologies capable of quickly and inexpensively producing millions of short (50–150 bp) DNA sequence reads has prompted the use of sequence information for the identification of SSR markers [2931]. At present, using NGS technology, the SSR markers developed are often genic SSRs based on transcriptional assemblies [3234]. However, the genic SSRs developed are derived from coding regions that are usually conserved, leading to lower polymorphism in comparison with genomic SSRs. The development of polymorphic genic SSRs requires more experimental screening work, increasing the cost of primer synthesis and wasting resources and time. For species without a reference genome sequence, the sequencing of a combination of libraries and assembly of DNA sequences may represent an effective approach to developing markers, even single-locus SSR markers, by genome survey sequencing [35]. Combining libraries with genomic DNA inserts of different sizes, thereby randomly breaking long DNA molecules, may provide not only more complete coverage of the genome but also the necessary information for genome assembly [36, 37], because with the random positioning of fragments on the source DNA, a majority of which overlap. The development of genomic markers using this method has many advantages: it is high-throughput, fast, and results in a relatively high polymorphism rates. Markers derived from de novo DNA assemblies can also exhibit improved accuracy and avoid some instances of amplification failure from the transcriptome assembly due to the location of primers in intron splicing sites, which would produce primer binding sites separated by genomic introns.

Peanut is an allotetraploid (2n = 4 × = 40, AABB) with a large genome (~2.7 Gbp). Because of the lack of genomic information, much effort in recent years has still been focused on developing markers for peanut genetics [12, 14, 26, 3845], with very little development of single-locus markers. In the process of constructing a consensus genetic map of the markers mapped in ten RILs and one BC mapping populations, a set of 58 single-dose SSR markers, which consistently amplified only one locus in the A or B sub-genome, was used to identify the sub-genomic origin of each linkage group, and 879 markers were eventually integrated into the map [46]. Zhou et al. [11] constructed a SNP-based linkage map that developed SNP markers using read mapping uniqueness to the consensus sequences as a filtering criterion. Consequently, the SNP markers on this map are for single loci in the AABB genome. Many existing SSR markers for the allotetraploid peanut are multi-locus because of polyploidy, and the amplified multiple fragments or loci may introduce many problems in population genetic studies. Single-locus markers can effectively avoid the issues caused by multi-locus markers. Therefore, it is attractive to develop genomic single-locus SSR markers in A. hypogaea for better application in genetic and breeding studies.

Here, four libraries were constructed and sequenced on the HiSeq 4000 platform. A de novo assembly of the DNA sequences was employed to specifically develop single-locus SSR markers in a genome-wide survey. A total of 134,652 single-locus SSR markers were developed, and their characteristics were analyzed. To validate the developed single-locus markers, some of them were evaluated by PCR-based amplification of twelve cultivated accessions, one F2 mapping population and one natural population.

Results

DNA sequencing and de novo genome assembly

The libraries with insert sizes of 270 bp, 500 bp, 2 Kbp and 5Kbp were sequenced with an Illumina HiSeq 4000 platform (Table 1). Massively parallel Solexa sequencing of the combination of libraries generated ~308 Gbp of raw data containing 2,056,876,970 paired-end reads, with each read being ~150 bp in length. After filtering and correction of the sequence data, a total of ~237 Gbp of clean data were obtained, with 1,675,631,984 high-quality reads and approximately 87.8 × coverage of the estimated 2.7 Gbp genome (Table 1).

Table 1 Summary of sequencing data

The program SOAPdenovo and all of the clean reads were used to generate a de novo assembly. This assembly included 2,102,446 contigs with an N50 of 1782 bp (Table 2). The majority of the contigs were in the range of 201–1000 bp (57.1 % of the contigs), and the longest contig length was 310,739 bp (Table 2). For scaffold assembly, only scaffolds greater than 200 bp in length were further analyzed. A total of 1,176,527 scaffolds were generated corresponding to 2.0 Gbp with an N50 length of 3,920 bp (Table 2). The length of the scaffolds varied from 200 bp to 576,627 bp, with an average of 1,693 bp; 360,557 scaffolds were longer than 2 Kbp and 9,448 scaffolds were longer than 10 Kbp (Table 2). The assembled genome size was ~2.0 Gbp, covering 73.6 % of the estimated 2.7 Gbp genome size. The GC content of the de novo assembled genome was 38.1 %.

Table 2 Statistics of de novo assembly results

Development and characterization of genome-wide single-locus SSR markers

The development of single-locus SSRs was based on all of the sequences from the 2.0 Gbp de novo genome assembly. We identified motif characters using the PERL5 script MIcroSAtellite [47] and designed primer pairs from the flanking sequences of the identified motifs using Primer3 software [48, 49]. Then, we aligned the primer pairs to the assembled scaffolds and found only one copy numbers as single-locus SSRs. Ultimately, 375,180 SSRs were found and 134,652 single-locus SSRs (Additional file 1: Table S1) were identified from them. The percentage of single-locus SSRs was 35.89 %. The frequency was 67.7 single-locus SSRs per Mb or one single-locus SSR per 14.8 Kbp. The ratio of single-locus SSRs from genetic to those from intergenetic regions was 11.2 % (13511/121141), and the ratio of non-selected SSRs from genetic to those from intergenetic regions was 14.6 % (47735/327441).

For all of the developed genomic single-locus SSR markers, a total of 155,665 motifs were found that were classified as mono- to hexanucleotide repeat types (the compound repeats were divided into corresponding nucleotide repeat types) (Table 3). The motif repeat number ranged from 3 to 146, and the repeat length was an average of 17.5 bp (Table 3). The trinucleotide repeat was the most abundant motif type, with 42,233 markers, accounting for 27.1 % of the total developed single-locus SSRs. The tetranucleotide motif also occurred at a high frequency of 26.5 %. The hexanucleotide motif had the lowest frequency of 3.9 % (Table 3). The investigation of nucleotide composition characteristics showed that A (95.1 %), AT (54.0 %), AAT (33.9 %), AAAT (37.7 %), AAAAT (29.3 %) and AAAAAT (13.7 %) were the most common motifs corresponding with the mono- to hexanucleotide repeats, respectively, suggesting that the SSRs have a tendency to be A/T rich in the peanut (Table 3). For each motif type, motif abundance decreased as the motif repeat number increased (Fig. 1). The slowest rate of decrease was for the dinucleotide motifs, and the fastest rate was for the hexanucleotide motifs (Fig. 1).

Table 3 The distribution of different types of single-locus SSRs identified
Fig. 1
figure 1

Motif frequency distributions of mono- to hexanucleotide motif types with different repeat numbers (from 3 to >20) in the de novo assembled genomic sequences of A. hypogaea

Validation and polymorphism detection of single-locus SSR markers in twelve inbred lines

To test whether the in silico developed SSR markers are single-locus, 1,790 SSR markers were selected to amplify the genomic DNA of 12 inbred lines (Additional file 2: Table S2). A total of 1,687 markers produced clear fragments, of which 1,637 (97.0 %) displayed a single amplicon, 32 (1.9 %) displayed two amplicons, and 18 (1.1 %) displayed three or more amplicons (Table 4). Of the 1,637 putative single-locus SSR markers, 290 (17.7 %) showed polymorphisms (Table 4, Additional file 2: Table S2).

Table 4 Amplification patterns of the 1,790 developed SSR markers in the 12 inbred lines

We also investigated whether the motif type, repeat length and repeat number influence the polymorphism rate of single-locus SSR markers. As shown in Fig. 2a, the highest polymorphism rate was observed for the dinucleotide motifs (36.8 %), with compound motifs also showing a high rate of 31.5 %, followed by mono- (16.7 %), tri- (12.0 %), tetra- (4.3 %) and pentanucleotide motifs (4.0 %), while the lowest rate was observed for hexanucleotide motifs (1.5 %). This tendency shows that the rate of polymorphism decreases as the motif length increases, with the exception of mononucleotide motifs. No obvious relationship between the polymorphism rate and repeat length was found. Further investigations of the polymorphism rate and repeat number revealed that the maximum polymorphism rate of the developed SSR markers was 46.4 %, corresponding to a repeat number of 11. When the repeat number was less than 11, a basic trend was that the polymorphism rate tended to increase as the motif repeat number increased (Fig. 2b).

Fig. 2
figure 2

Relationship of the polymorphism rates of putative single-locus markers to the motif type (a) and repeat number (b)

Evaluation of inheritance and assignment of single-locus SSR markers to the linkage map

To confirm whether the developed markers amplifying a single amplicon are truly inherited in a single-locus mode, as well as to assign them to the Arachis linkage map, 101 high-quality markers that produced only single amplicons in the twelve inbred lines and also showed polymorphism between Zhonghua10 and ICG12625 were used for their F2 population survey. Of the 101 markers, 97 (96.0 %) segregated in the F2 population in accordance with the Mendelian inheritance law for single loci (1: 2: 1, P < 0.01); thus, these single-locus markers were thought to be true. Because segregation distortion is a common biological phenomenon in analyses of the genetic localizations of hybrid segregating populations [5052], the 4 distorted markers (AHGA331177, AHGA193642, AHGA75014, AHGA84019) will be further tested for possible single-locus nature in subsequent research.

To assign these single-locus SSR markers to a linkage map, our previously published map for the F2 population derived from Zhonghua10 and ICG12625 was used as a basic frame [53]. We integrated the genotypes of these markers with previously published SSR markers. Finally, a linkage map showing the distribution of 504 SSR markers into 21 linkage groups was constructed, covering a distance of 1,504.31 cM (Fig. 3). A total of 87 (86.1 %) of the 101 single-locus SSR markers were integrated onto the map, of which 47 (54.0 %) were assigned to the A genome and 40 (46.0 %) to the B genome. The 87 single-locus markers were distributed among all of the linkage groups, with A04, at 10 single-locus markers, containing the largest number of the identified markers.

Fig. 3
figure 3

Distribution of single-locus SSR markers on the genetic linkage map. The map was constructed using 154 F2 plants derived from Zhonghua 10 and ICG12625. The single-locus markers developed in this study are shown in boldface and are underlined. The markers are shown on the right side of the LGs, and the map distances are shown on the left side

Stability and universality of polymorphic single-locus SSR markers in A. hypogaea

To confirm whether the polymorphic single-locus markers tested in the 12 inbred lines are also stable and universal in more diverse lines and to test usage of the markers in DNA fingerprinting and diversity analyses, we used a population, including a set of 96 A. hypogaea accessions (Additional file 3: Table S3), for genotyping. A total of 100 markers were randomly selected from the polymorphic single-locus SSR markers tested in the 12 inbred lines to amplify the DNA template of this natural population, including the 4 markers with skewed segregation in the above F2 population. A total of 95 markers displayed single alleles in more than 95 % of the lines, 3 displayed single alleles in 90 %–95 % of the lines, and 2 displayed single alleles in 80 %–90 %. Furthermore, the observed heterozygosity (H o ) value at each locus was calculated. The H o values of the chosen SSR markers varied from 0 to 0.10 with a mean of 0.01, approaching 0 and maintaining consistency with the genomic characteristics of the inbred lines (Table 5). Among them, the H o value of 74 (74 %) loci was 0, indicating that these inbred lines were homozygous at these loci. The remaining 26 markers each detected very few heterozygous lines and had a Ho value ranging from 0.01 to 0.10. Notably, that the 4 markers that show skewed segregation in the F2 mapping population all appeared as single alleles in more than 95 % of the lines, suggesting that they were also single loci. All of the selected markers appeared as single alleles in most of the A. hypogea accessions, except for very few multi- or null loci, suggesting that the SSR markers have a universal single-locus nature in the peanut panel.

Table 5 The genetic diversity of 100 SSR markers revealed by 96 A. hypogaea accessions

To ascertain the potential value of the polymorphic single-locus markers in genetic studies, their genetic diversity in the 96 inbred lines was investigated. The 100 SSR markers generated 428 alleles (Table 5). The numbers of alleles varied from two to eighteen with a mean value of 4.28 per locus (Table 5). The PIC values of the 100 single-locus SSR markers varied from 0 to 0.86, with a mean of 0.33 (Table 5). The phylogenetic relationships of the 96 accessions were assessed using the 100 SSR markers by constructing a neighbor-joining tree (Additional file 4: Figure S1). At a similarity coefficient ≥ 0.81, the largest subgroup consisted of 39 accessions, 69.2 % of the accessions were ssp. hypogaea (including 23 var. hypogaea and 4 var. hirsute accessions), 15.4 % accessions were spp. fastigiata (including 5 var. vulgaris and 1 var. fastigiata accessions), and 15.4 % accessions were intermediate type (Additional file 4: Figure S1; Additional file 3: Table S3). The second-largest group included 31 accessions, 96.8 % of the accessions were spp. fastigiata (including 27 var. vulgaris and 3 var. fastigiata), and 3.2 % accessions were spp. hypogaea (including 1 var. hypogaea accession) (Additional file 4: Figure S1; Additional file 3: Table S3). At a similarity value of 0.76, a little subgroup includes 8 accessions and the number of spp. hypogaea and spp. fastigiata were each half (Additional file 4: Figure S1; Additional file 3: Table S3). In spite of a small amount of discrepancies, our results indicate that the botanical varieties of the accessions in this study obviously correspond with the genetic distances between accessions and as a result the genetic relationships among them.

Discussion

SSRs are tandem repeats of short nucleotide motifs with a polymorphism of a certain length that are spread throughout the genome. SSRs are highly versatile, PCR-based markers that are usually associated with a high frequency of length polymorphism; thus they have a wide range of applications in genetic research and molecular breeding. However, many studies have revealed that the developed SSR markers usually amplify multiple fragments from homologous DNA sequences, because of the polyploid natures of many species [27, 54]. The multi-locus nature of SSR markers can complicate or cause errors in genotype scoring due to the reciprocal overlapping and uncertain allelism of these fragments [22]. Single-locus SSR markers can avoid this type of problem and are considered ideal markers for topics such as diversity analysis, variety identification and association analysis. Sets of high-quality single-locus SSR markers have previously been developed in plants such as potato, barley, rape, maize and grape [22, 5558]. In our study, we developed 134,652 single-locus SSR markers for peanut. To our knowledge, this is the first report of the specific development of single-locus SSR markers in a genome-wide survey of A. hypogaea.

The combination of library sequencing and de novo assembly represents a fast and reliable approach for the generation of large datasets for peanut and also allows for the identification and development of single-locus SSRs through data mining. For assembly, the combination of libraries with different insert sizes could improve contig scaffolding much more effectively than the increasing of the physical coverage for a single insert library [59]. We generated four libraries with different insert sizes, including two libraries produced with mate pair sequencing and two short fragment insert libraries that were prepared in a separate experiment. Both ends of 150 bp reads from the four libraries could produce overlapping of the sequenced fragments and generate elongated reads. Insert sizes of 2 Kbp and 5 Kbp were more efficient than short-insert libraries (270 bp and 500 bp) because of their abilities to bridge the longer and more abundant long interspersed nuclear elements (LINE) and long terminal repeat (LTR) elements [37, 59]. The final assembly had a contig N50 value of 1,782 bp and a scaffold N50 value of 3,920 bp. The longest scaffold in the assembly was ~576.6 Kbp, and 360,557 scaffolds were longer than 2 Kbp (Table 2). The current assembly of the draft genome is 2.0 Gbp, covering 73.6 % of the estimated 2.7 Gbp total genome size. This is the first report of de novo genomic assembly of A. hypogaea and it can be improved by the additional sequencing of larger insert libraries to increase the contig and scaffold sizes. In addition, the data source here will contribute to genomic research of peanut.

In our study, 134,652 single-locus SSR markers were identified from 375,180 SSRs. The ratio of single-locus SSRs from genic to those from intergenic regions (11.2 %) was lower than the ratio of non-selected SSRs from genic to those from intergenic regions (14.6 %). This is probably because peanut is an allotetraploid and the genic regions are usually conserved, leading to high similarity of homoeologous genes or SSR flanking sequences in genic regions between A and B subgenomes. We developed single-locus SSR markers using only one copy numbers of primer pairs to the assembly genome scaffolds as an identification criterion. The same primer pairs in genic regions causing by homoeologous between A and B subgenomes were filtered out in our analysis.

For the developed 134,652 single-locus SSR markers, we analyzed many important characteristics. Among all of the motif types, trinucleotide repeats were the most abundant, accounting for 27.1 % of the total markers. This result may have occurred because trinucleotide repeats are just an integration of multiple codons, which do not cause frameshift mutations [60], and the prevalence of trinucleotide motifs [61] may suppress the other motif types, thus reducing the incidence of frameshift mutations caused by nontriplet repeats [62]. Interestingly, the dominant/major motifs (A, AT, AAT, AAAT, AAAAT and AAAAAT) were all A/T rich mono- to hexanucleotide motifs in peanut, which is similar to previous reports on species such as Brassica napus, rice, and Arabidopsis [54, 63, 64]. From Fig. 1, we observed that the motifs which have 3 and 4 repeats number displayed higher frequencies, 39.3 and 24.98 %, respectively. The frequency of the motifs which have 5–10 repeats number was 25.6 and > 20 repeats number had the frequency of 5.69 %. Moretzsohn et al. [14] mined 271 SSR markers in the AA genome of Arachis and performed a similar analysis using a two-dimensional diagram. In that study, the criteria for SSRs were different; mono- and hexanucleotide SSRs were not included; 3- and 4-repeat motifs of di- to pentanucleotide SSRs were also not included; and the product size extended to 400 bp. Therefore, markers which have 5–10-repeat motifs were most frequent, followed by > 20-repeat motifs, in contrast with the results of our survey.

Among the 1,637 selected markers that displayed a single amplicon in the twelve inbred lines, 290 (17.7 %) exhibited polymorphisms. In this study, dinucleotides motifs had higher rates of polymorphism than those with other repeat motifs, and the polymorphism rate for the single-locus SSR markers decreased as the motif length increased. In an investigation performing genome-wide SSR characterization of cucumber (Cucumis sativus L.), similar results were observed: dinucleotides (47 %) were the most common polymorphic motif, followed by tri- (29.3 %), tetra- (12.4 %), penta- (4.5 %), hexa- (6.9 %) [65]. This result also corresponded to the SSR mutation rates of di-, tri-, and tetranucleotide repeats in the genome of D. melanogaster, which found that tri- and tetranucleotide repeats mutate at rates 6.4 and 8.4 times slower than that of dinucleotide repeats, respectively [66]. In addition, we found that the polymorphism rate of the single-locus SSRs increases with increasing repeat number. Similar results have been described for several plant species [54, 6769]. In Brassica, genome-wide SSR characterization showed that the polymorphism rate of the tested SSR markers was highly positively correlated with the motif repeat number (r = 0.74) [54]. In carrot, SSR analysis revealed a similar trend between the polymorphism rate and the repeat number; and markers containing 11–15 repeat units displayed the highest polymorphism rates [67]. This relationship is also understandable because larger motif repeat number give more opportunity for replication slippage events.

A single-locus SSR marker is revealed by a pair of oligonucleotide primers with tandem repeats of short nucleotide motifs between them and can be used in a PCR assay to detect unique site in the genome [22]. It is possible to identify these single-locus markers in DNA sequences using electronic PCR (e-PCR) by searching for subsequences of a query sequence that match the PCR primers and are in the correct order, orientation, and spacing to be consistent with the PCR product size [70, 71]. Here, using e-PCR, we identified a large number of single-locus SSRs based on the de novo assembled genomic sequences. Among 1,790 randomly selected in silico single-locus SSRs, 1,637 were able to be successfully amplified with only one band. The results demonstrate the high efficacy of e-PCR for identifying unique SSR loci in peanut.

Single-locus markers are considered to have wide utility in linkage map construction and genetic analysis of crop species due to their uniqueness. In our study, 101 high-quality SSR markers showing polymorphisms between the parental lines of Zhonghua10 and ICG12625 were experimentally confirmed as single-locus SSRs, and 89 were finally anchored in a peanut genetic map. Because these markers were located on specific chromosomes, and exhibited the characteristics of co-dominance, polymorphism and stable amplification, they can serve as anchor markers in the construction of genetic maps, thereby helping with the integration of different linkage groups. Also, polymorphism screening performed using these newly developed SSRs will greatly increase the density of SSR markers in the peanut genetic map in the future. In addition, a panel of 96 accessions was used to verify that a subset of 100 SSRs showing polymorphism and one amplicon in each of the twelve lines were genuinely single locus. These markers were further investigated for their potential use in genetic studies by ascertaining their genetic diversity in the natural population. The 100 single-locus markers generated 428 alleles with PIC values ranging from 0 to 0.86, with an average of 0.33. A set of 30 0f the 100 single-locus SSRs markers were highly informative with PIC > 0.50 (Table 5). The informative markers will be very useful to accelerate molecular genetics and breeding studies in cultivated peanut. Peanut consists of two subspecies (ssp. hypogaea and spp. fastigiata) and six botanical varieties (var. hypogaea, var. hirsuta, var. aequatoriana, var. peruviana, var. vulgaris, and var. fastigiata) that are classified based on the morphological traits of plants collected from the field [72]. Some accessions that did not belong to any of these six varieties according morphological assessment were called as intermediate varieties, because these accessions were probably generated from hybridization between different varieties. In the phylogenetic analysis of the 96 peanut accessions, the vast majority of accessions (89 %) in the two largest groups were from China, and most of exotic accessions (56.7 %) were not clustered in the two groups, suggesting the genetic basis of Chinese and exotic accessions were different. There were only one accession of var. aequatoriana and no accession of var. peruviana among the material collected. To enlarge the genetic basis, more exotic accessions should be used in future peanut breeding programs.

In many crops, genome-wide patterns of genetic variation consistently exist among different accessions [73, 74]. Studies of the seven wild relatives of soybean have revealed that approximately 80 % of the pan-genome is present in all accessions (core), whereas the rest show greater variation than the core genome, perhaps reflecting a role in adaptation to diverse environments [37]. Analysis of resequencing data of six elite maize inbred lines has revealed more than 1,000,000 SNPs, 30,000 indel polymorphisms and 101 low-sequence-diversity chromosomal intervals in the maize genome [75]. In our study, we used de novo assembled genomic sequences of Zhonghua 16 to design single-locus SSR markers, but a single genome does not adequately represent the diversity contained within a species. Although we used unique matching as the criterion for developing SSR markers, some markers were amplified at more than one locus in some accessions in the PCR-based experiment. Among our 1,790 validated markers, 1637 were amplified at one locus in each of the 12 lines, and 50 were amplified at more than one locus in at least one line (Table 4). In the natural population, many SSR markers displayed more than a single allele in a small number of accessions. The cause of this phenomenon may be that these loci show homeologous or heterozygous characteristics in the genomes of these accessions.

Conclusions

In this study, we developed single-locus SSR markers by sequencing a combination of libraries and generated a de novo assembly of the genomic sequences of A. hypogaea accession Zhonghua 16. Using an e-PCR approach, 134,652 single-locus SSRs were identified by aligning primer pairs against the assembled 2.0 Gbp sequences. The validation of a set of developed markers in the twelve inbred lines, in a more diverse set of 96 accessions and in an F2 mapping population of 154 individuals shows the high accuracy of the developed single-locus markers. The genome wide single-locus SSR markers developed in this study will provide a useful resource for molecular markers analyses, linkage map construction, QTL mapping, and molecular breeding.

Methods

Library preparation and Illumina sequencing

The inbred line Zhonghua 16 was selected on the basis of its agronomic importance and the self-owned brand. The cultivar is widely grown in China and is early maturing, produces a high-yield and is resistant to drought, lodging, late leaf spot disease and rust. Short-insert (270 bp and 500 bp) and mate-pair (2 Kbp and 5Kbp) genomic DNA libraries of Zhonghua 16 were constructed. The libraries were sequenced on a llumina HiSeq 4000 platform. Using Trimmomatic 0.3 [76], low-quality, contaminant sequences were trimmed. The following types of reads were filtered: those 1) with ≥10 % unidentified nucleotides (N); 2) with >10 nt aligned to the adaptor, allowing for ≤10 % mismatches; 3) with >50 % bases having a phred quality of <5; 4) putative PCR duplicates generated by PCR amplification in the library construction process.

De novo assembly

ErrorCorrection from SOAPdenovo [77] was used to connect 270-bp library paired-end reads and to generate longer sequences for assembly. Reads from all libraries were used for contig building, and 2 Kbp and 5 Kbp libraries were used to provide links for scaffold construction. GapCloser from SOAPdenovo [77] was used for gap filling within assembled scaffolds using all paired-end reads. Finally, scaffold sequences, which can be aligned to bacterial genomes with identity ≥95 % and e-value ≤1e-5, were filtered out. For identification of potential protein-coding regions in the assembly sequence we have used the gene prediction programs Fgensh [78].

In silico single-locus SSR development

In silico single-locus SSRs that are developed should not only accord with the characteristics of SSR markers but also meet the unique characteristics of the reference genome. For the identification of SSRs, the PERL5 script MIcroSAtellite (http://pgrc.ipk-gatersleben.de/misa/) [47] was used. The motif length was defined as the default mono- to hexanucleotide, and the minimum repeat numbers of the motifs were defined as 12, 6, 4, 3, 3 and 3, respectively. For designing the primer pairs from the flanking sequences of identified SSRs, the primer3_core program (http://bioinfo.ut.ee/primer3/) was used [48, 49]. The primer design parameters were set as follows: primer length of 18–27 nucleotides, melting temperatures of 55–65 °C, GC content of 30–70 %, and predicted PCR products of 100–300 bp in length. For identification of the copy numbers, the primer pairs were aligned to the de novo assembly genome scaffolds of Zhonghua 16. This alignment was conducted using e-PCR [70] with the following default parameters: 2 bp mismatch, 1 bp gap, 50 bp margin and 50–1000 bp product size. The SSR markers that hit only one locus in the de novo assembled genome were considered single-locus SSR markers. The developed SSR markers were designated as AHGA (Arachis hypogaea de novo genome assembly) markers.

DNA isolation, PCR amplification and electrophoresis

Genomic DNA was extracted from tender leaves using the modified cetyltrimethylammonium bromide (CTAB) method, essentially as described by Grattapaglia and Sederoff (1994) [79]. PCR amplification was performed in a 10 μl PCR reaction volume, containing 15 ng DNA template, 2.5 μl 2× EcoTaq PCR SuperMix and 4 pM each of the primers. PCR amplification was performed with a T100 Thermo Cycler (BIO-RAD) using the following touchdown program profile: 95 °C for 5 min; 95 °C for 30 s, 65 °C for 30 s, and 72 °C for 45 s for 9 cycles, with a reduction in the annealing temperature 1 °C per cycle; 95 °C for 30 s, 55 °C for 30 s, 72 °C for 45 s, 30 cycles; 72 °C for 5 min. The amplification products were separated by electrophoresis on 6 % denaturing polyacrylamide gels and visualized using silver-staining according to Bassam [80].

Amplification pattern testing in 12 inbred lines, genetic localization and map construction of an F2 population

The randomly selected 1,790 SSR primers developed in this study were used to amplify the genomic DNA of the twelve peanut inbred lines. These lines were used as the parents of six different mapping populations (Fuchuan, ICG6375, Zhonghua10, ICG12625, Yuanza9102, Xuzhou68–4, Zhonghua6, Xuhua13, Zhonghua5, ICGV86699, Chico, Jihua9331).

The parents ‘Zhonghua10’ and ‘ICG12625’ and 154 of their F2 progenies were used for genetic localization. The putative single-locus SSR markers showing high quality and polymorphism between Zhonghua10 and ICG12625 were selected. Genotyping of the chosen polymorphic markers was performed on F2 individuals, and the allele patterns were investigated. Marker segregation was assessed with the χ 2 test to examine whether they segregated as expected (1:2:1).

For the linkage map construction, input datasets were constructed from the genotypes of 101 AHGA markers in 154 F2 lines and integrated with the genotypes of 497 SSR markers from our previous studies [53]. The program JoinMap 4.0 [81] was used to calculate the marker order and genetic distance and the Kosambi mapping function was employed for map length estimations. The recombination frequency was set at ≤ 0.45 and LOD scores at ≥ 2.0.

Validation of single-locus markers in a natural population

A subset of 100 developed polymorphic SSRs with one amplicon in each of the 12 inbred lines was randomly selected and a panel of 96 accessions (provided by theNational Medium-term Peanut Genebank of China) from China (66), India (24), America (5) and Zambia (1) was used for stability and diversity analyses. The genetic statistics based on the population, including the number of alleles, H O and PIC, were calculated using the PowerMarker version 3.51 [82]. At a single-locus, Ho was determined using the following equation:

$$ Ho=1-{\displaystyle \sum_{u=1}^n}{p}_{uu} $$

in which p uu is the individual frequency with homozygous allele u, and n is the number of alleles. The PIC value of individual SSR markers was calculated based on the following formula:

$$ \begin{array}{l}\mathrm{PIC}=1-\underset{i=1}{\overset{n}{\varSigma }}{p_i}^2-2\left[{\displaystyle \sum_{i=1}^{n-1}\underset{j=i=1}{\overset{n}{\varSigma }}}{p_i}^2{p_j}^2\right]\\ {}\end{array} $$

in which pi is the ith allele frequency and n is the number of alleles.

Coefficients of genetic similarity for the 96 cultivated accessions used in this study were calculated using the SIMQUAL program of NTSYS-pc Version 2.10 [83]. A neighbor-joining tree was constructed based on the genetic similarity matrix with the SHAN clustering program [84, 85] of NTSYS-pc using the UPGMA algorithm.

Abbreviations

AFLP, amplified fragment length polymorphism: CTAB, cetyltrimethylammonium bromide: e-PCR, electronic PCR: H O , observed heterozygosities; INDEL, insertions/deletion; LD, linkage disequilibrium; LINE, long interspersed nuclear element; LTR, long terminal repeat; MAS, marker-assisted selection; NGS, next-generation sequencing; PIC, polymorphism information content; RAPD, random amplified polymorphic DNA; RFLP, restriction fragment length polymorphism; SNP, single nucleotide polymorphism; SSR, simple sequence repeat