Genome-wide characterization and selection of expressed sequence tag simple sequence repeat primers for optimized marker distribution and reliability in peach
Simple sequence repeats (SSR) in Prunus expressed sequence tags (EST) were mined, and flanking primers designed and used for genome-wide characterization and selection of primers to optimize marker distribution and reliability in peach. A total of 4,770 and 9,029 SSRs were identified from 12,618 contigs and 34,238 singlets, from which 3,695 and 6,849 primers were designed, respectively. Alignment of the 10,544 forward and reverse primer sequences (21,088 queries) against the peach reference genome at 9e-03 resulted in 23,553 hits (96,621 alignments) with 16,885 queries, and “no hits found” (NHF) for the remaining 4,203 queries. A majority of aligned primers had only one hit/alignment on the peach scaffolds, and the distribution of the 5,500 singly aligned primers (pairs) on each 500-kb genome interval was determined. The average number of ESR-SSR primers per 500-kb interval was 10.8. The primers were categorized into eight subgroups based on the difference between the genome amplicon size and expressed amplicon size of each primer, with 288 primers of optimized distribution and reliability selected for genotype evaluation. Only 2 of the 288 primers failed in all 4 peach cultivars screened, with an overall successful primer/sample rate of 97.2 %. The average number of alleles detected in the four cultivars was 3.84. The polymorphism information content (PIC) values suggested that a majority of the 288 primers had a high rate of allele polymorphism among the four peach cultivars. The advantages of genome-wide analysis of EST-SSR primers and options to improve the polymorphism rate are discussed.
KeywordsMicrosatellite Short tandem repeat (STR) Marker-assisted selection (MAS) Variety authentication Reference genome
DNA markers and methodologies have changed in the last 20 years, including the range of marker types, attributes, popularity, development approaches, detection technologies, and throughputs. DNA markers have allowed many molecular studies to make important advances in genetics, taxonomy, ecology, and evolution (Agarwal et al. 2008). In the pregenome era, widely used DNA markers include restriction fragment length polymorphisms (RFLPs), random amplified polymorphic DNAs (RAPDs), cleaved amplified polymorphic sequences (CAPS), and amplified fragment length polymorphisms (AFLPs); in the postgenome era, sequence-based codominant markers such as simple sequence repeats (SSRs, also called microsatellite or short tandem repeats—STRs) and single-nucleotide polymorphisms (SNPs) prevailed partly due to concurrent development of massive genomic sequences and computational capabilities along with the initiation and accomplishment of many genome programs (McCarthy 1993; Thiel et al. 2003; Chen et al. 2006; Tang et al. 2006; Horner et al. 2010; Chen and Gmitter 2013; International Peach Genome et al. 2013). Most SSRs are developed from publicly available gene-derived expressed sequence tag (EST) sequences (Thiel et al. 2003; Chen et al. 2006; Kayesh et al. 2013; Miah et al. 2013). An assessment of plant genomes suggested a significantly higher SSR frequency in the low-copy transcribed regions compared with other regions of the genome (Morgante et al. 2002). Generally, SNP genotyping is performed at high throughput and requires expensive proprietary instruments and allele analysis/calling programs (Chen and Sullivan 2003), which is not economically practical for marker applications on a routine and budget-constrained basis. SSR genotyping is more affordable and suitable for many studies because of its throughput and detection flexibility (Chen et al. 2008; Miah et al. 2013). However, the distribution and performance (detectability) of randomly selected SSRs are generally unknown, which can result not only in unevenly distributed primers, but also many primer failures in linkage analysis or other SSR marker applications (Chen et al. 2008; Kayesh et al. 2013; Miah et al. 2013). Furthermore, the allelic heterozygosity and the polymorphism rate of randomly selected EST-SSR primers tend to be very low (Chen et al. 2008), which negatively impacts the power of gene mapping. The heterozygosity (H) and polymorphism information content (PIC) values are the most widely used indices and measures to evaluate and predict whether a genetic marker will be informative among cultivars or strains (Terwilliger et al. 1992; Pettersson et al. 1995; Ott and Rabinowitz 1997; Liu and Muse 2005). The failure of these particular SSR primers and an unpredicted, relatively high rate of homozygosity and nonpolymorphism in successfully detected primers are major factors precluding SSR markers from being efficiently used in genetic mapping or other studies relying on allelic heterozygosity and polymorphisms. With many reference genomes now available, genome-wide characterization of EST-SSR primers likely offers a solution to the issue or at least enables optimal selection of primers with predicted distribution, a lower risk of primer failure, and a higher polymorphism rate, compared to a random selection of SSRs (Chen et al. 2008; Kayesh et al. 2013).
Little attention has been given to failed and/or poorly performing EST-SSR primers due to the difficulty (time and cost) in addressing the unknown causes on a primer-by-primer basis and the lack of interest (i.e., lack of value) in reporting them. The success of an EST-SSR primer depends on the amplification process and instrument-dependent detectability of the products. In other words, the failure of a primer can be caused by a failed amplification or a failed detection if amplification is successful. A recent sequence analysis of 340 failed and successful EST-SSR primers on a reference genome revealed several genomic factors affecting the primer performance and polymorphism in the studied genomes. The main causes of failed primers are due to first the forward and reverse primers being positioned too far into the target genome to form the expected amplicons due to sequencing/assembly errors in EST contigs or in the reference genome; second, introns being too long to allow the genomic amplicon to be detectable (and/or reliably amplified); third, multiple full and partial primer alignments likely caused by paralogs; and fourth, failed full alignment of contig-derived primer sequences containing discrepant nucleotides (Chen et al. 2014). Therefore, based on their alignment on a reference genome, primers can be effectively distinguished and assigned into different reliability categories; meanwhile, their distribution in the genome is also determined. The information provides clear guidance for selection of well-distributed primers from only highly reliable categories and of either genome-wide or localized interest.
Different types of DNA markers have been used to study various aspects of peach (Prunus persica L. Stokes) and other Prunus species. RAPDs were used for genetic linkage mapping (Warburton et al. 1996), AFLPs were used for the diagnosis and mapping of peach tree short life (PTSL) syndrome (Blenda et al. 2006, 2007), SSRs were used for genome and trait mapping (Bliss et al. 2002; Howad et al. 2005; Ogundiwin et al. 2009; Lambert and Pascal 2011), and a 9k SNP array developed that has potential for many marker applications (Verde et al. 2012). Since the first set of peach SSRs was developed (Cipriani et al. 1999), the number of SSRs on peach genetic maps ranges from 4 (Bliss et al. 2002), to 21 (Lambert et al. 2004), and to 264 collected from various sources (including EST-SSRs) and mapped by bin mapping with an almond × peach F2 population (Howad et al. 2005). Increasing Prunus ESTs available in the Genome Database for Rosaceae (GDR) allows identification of substantially more EST-SSRs (Jung et al. 2004, 2008, 2014). However, without further genome-wide characterization of these EST-SSR primers, it would be impossible to optimally select core sets of primers with optimal distribution, reliability, and polymorphism. In this study, Prunus EST-SSR primers were mined and aligned onto the peach reference genome (International Peach Genome et al. 2013) to determine their genome distribution, to predict genomic amplicon sizes, and to characterize the genomic features in these amplicons. Following these results, a core set of primers with optimal distribution, reliability, and polymorphism were selected as candidates for potential use in various marker applications in peach and other Prunus species.
Materials and methods
Prunus ESTs and varieties
A total of 118,965 unique Prunus ESTs were retrieved from the GDR. Peach had the most ESTs (81,200), with the remainder from apricot (Prunus armeniaca L.), sweet cherry (Prunus avium L.), and eight other species or biotypes of Prunus (Electronic Supplementary Material [ESM] Table 1). The FASTA header of each sequence was simplified to its accession ID that was further added a unique two-letter prefix derived from species names to track the genotype source. Four peach cultivars of different origin and characteristics (Okie 1998), “Chinese Cling,” “Blazeprince,” “Helen Borchers,” and “Heath Cling,” were used to screen the selected microsatellites. Genomic DNAs were isolated from 5-g tender, young leaves using a CTAB protocol slightly modified from the method previously described (Doyle and Doyle 1987).
Bioinformatics of EST-SSR mining and primer modeling
Bioinformatics programs were installed in Linux CentOS. All the sequences were combined into a single file and assembled under 95 % similarity using CAP3 (Huang and Madan 1999). All contigs and singlets were used for microsatellite motif identification using misa (Thiel et al. 2003) with primer design achieved using Primer3 (Rozen and Skaletsky 2000). The paired numbers representing microsatellite motif length and minimum repeat number in the misa configuration file were modified to 2-6 3-4 4-3 5-3 6-3 (the mono-type were excluded), and the maximum interval between any adjacent microsatellites remained 100 bp (Thiel et al. 2003). The primers were designed with an optimal length of 24 bp and with expected PCR products of 100–300 bp (Rozen and Skaletsky 2000). The primer result files were saved in a tab-delimited text file (ESM Table 2) and imported into MS Excel file for subsequent analysis and summary.
Primer sequence alignment with the peach reference genome
The peach reference genome sequence version 1.0 (International Peach Genome et al. 2013) was retrieved from the GDR (Jung et al. 2004, 2008, 2014) and formatted into a database by formatdb in BLAST (Altschul et al. 1997). The microsatellite primer sequences were formatted into FASTA format for BLASTN against the reference genome sequence. The BLAST cutoff e value was set at 9e-03 (0.009) that allowed any alignment of 19+ contiguous nucleotides or 22+ with only 1 discrepancy to be saved in the blast output file. Those primers with “no hits found” (NHF) at the cutoff e value were picked out to run BLAST without an e value restraint to determine any possible alignments on the reference genome and thus gain an additional assessment of these primers. The SSR primer sequences mapped in Prunus and available in the GDR were also formatted and included in the blast run. The aligned position information was used to avoid duplicated selection of primers flanking the same SSR motifs and loci, i.e., to ensure only selection of new EST-SSR primers at unmapped loci.
The genomic amplicon size (GAS) of each primer was calculated by subtraction of the maximum and minimum values representing the four start and end alignment positions of the forward and reverse primers on the reference genome. The difference between each GAS and the predicted corresponding EST amplicon size (EAS) was calculated. If equal to 0, it simply suggested the same sequence length of the genomic and expressed amplicon, and if not 0, it would tentatively represent the size of any intron(s) in the genomic sequence or allelic deletions/insertions at the locus, depending on the presumed cutoff minimum intron length (Wendel et al. 2002). No GAS can be calculated for NHF primers for either the F or R primer since there is no or only one aligned primer. If a GAS was not in an acceptable amplicon size range (e.g., over several hundred kilobases) and there were multiple alignments of either or both primers on different scaffolds, a search was performed to find if there were paired forward and reverse alignments on the same scaffold and move the primers to the first location for the calculation. Those forward and reverse primers aligned on two different scaffolds or with excessively long GASs (e.g., >=10,000 kb) on the same scaffolds were tentatively categorized into an “error” subgroup. The distribution of these microsatellite loci on the reference genome was determined based on the start alignment positions of all F primers. A total of 288 new primers (ESM Table 3) evenly distributed on all 8 scaffolds, about 1–2 primers in every 0.5–1 Mb genome interval, were selected for subsequent genotyping validation. Two additional selection criteria were also used: (1) The GAS of these primers had to be 80 to 480 bp (preferentially100 to 300 bp) so as to fit the size range of ensured detectability in widely used fluorescence/capillary or polyacrylamide gel (PAG)-based platforms; (2) if available, the GAS of these primers were preferred to contain presumed allelic deletions/insertions or introns so as to potentially maximize the polymorphism rate. Primers with no differences between EAS and GAS were selected only if there was no other primer choice. Other primers, including those with NHF, error, or oversized intron/GAS, were excluded. The selected primers will have useful applications in genetic studies of Prunus in the future.
Microsatellite genotyping and polymorphism validation
Microsatellite genotyping was performed as previously described (Chen et al. 2006). The M13 forward primer sequence (GTT GTA AAA CGA CGG CCA GT) was added as a common tail to the 5′ end of all microsatellite forward primers (Oetting et al. 1995). The tagged forward primers and their nontagged reverse primers were synthesized by Eurofins MWG Operon Technologies (Huntsville, AL). All the 288 forward and reverse primers were stored in six 96-well plates for high-throughput screening and genotyping. On the other hand, for easy identification and use of individual primers from the plates in the future applications, the 288 primers were named by their well positions prefixed with “CX” and the plate number (1, 2, and 3), for example, CX1A01 named for the primer at the A01 position of Plate 1F and 1R, CX3H12 for the primer at H12 of Plate 3F and 3R, and so on (ESM Table 3). Four fluorescently labeled M13 tags with 6FAM, VIC, NED, and PET labels were synthesized by Life Technologies (Carlsbad, CA). PCR was performed in a C1000 Touch Thermal Cycler with a CFX384 block module (Bio-Rad, Hercules, CA) in a 5-μl volume consisting of 1× PCR buffer, 0.2 mM dNTPs, 2 mM MgCl2, 0.3 μM of the forward and reverse primers, 0.05 μM dye-labeled M13 tagged forward primer, 0.5 U Taq DNA polymerase (BioExpress, Kaysville, UT), and ~10 ng DNA template. A touchdown PCR program was run with an initial step of 94 °C for 3 min, followed by 10 cycles of denaturation at 94 °C for 30 s, annealing at 61 °C for 30 s with a 0.5 °C decrement each cycle, and extension at 72 °C for 45 s, followed by 30 more cycles with a constant annealing temperature of 56 °C (other parameters were the same), plus a final extension at 72 °C for 15 min. The dye-labeled PCR products were genotyped on a 3100xl Genetic Analyzer (Life Technologies, Carlsbad, CA). GeneMarker 2.4 (SoftGenetics, State College, PA) was used to analyze the chromatographic trace files and generate the microsatellite allele table.
Genotyping data analysis
The allele table was converted to the format required by PowerMarker (Liu and Muse 2005) and imported to the program to calculate the number of alleles detected, the H value, PIC value, and the gene diversity value of each marker among the four peach cultivars, which were used to evaluate and predict the informativeness and usability of the primers.
In silico identification of SSRs in Prunus ESTs
EST SSR mining and primer modeling summary
Total number of sequences examined
Total size of examined sequences (bp)
Total number of identified SSRs
Number of sequences containing SSR
Number of sequences containing more than 1 SSR
Number of SSRs present in compound formation
Number of records created by Primer3
Number of sequences of primer modeling successful
Number of sequences of primer modeling failed
Genomic distribution based on primer alignment on the peach reference genome
Primer count based on hits/alignments and aligned peach genome scaffolds
Number of hits
Number of alignments
F R pairs
0 (no hits found)
Scaffold length (bp)
Genomic features identified by primer alignment on the peach reference genome
Genotyping evaluation of optimally selected EST-SSR markers
Advantages of EST-SSR primer alignment on a reference genome
EST-SSR primers have been widely used in many applications (Chen et al. 2008; Miah et al. 2013). Recently, especially when a relatively large number of primers are needed, the primers have most often been randomly selected from mass EST-SSR primers developed through sequence mining (Thiel et al. 2003; Chen et al. 2006). Random selection of EST-SSRs results in primers and markers of unknown status, including the distribution in the genome, the amplification and detection performance, and the polymorphism rate. As a consequence, a high rate of failure in amplification and/or detection, and frequently a low polymorphism rate and/or a skewed genome distribution are observed (Chen et al. 2008). A recent sequence analysis of 340 EST-SSR primers on a reference genome revealed that the main causes of failed primers included cases where the forward and reverse primers were positioned too far from the site to form the expected amplicons due to sequencing/assembly errors in EST contigs or the reference genome; or when introns were too big to allow the genomic amplicon to be detectable (and/or reliably amplified); or when multiple full and partial primer alignments occur, most likely caused by paralogs; or when failed full alignment of contig-derived primer sequences contained discrepant nucleotides (Chen et al. unpublished data). By alignment of all the Prunus EST-SSR primers on the peach reference genome, the featured genomic information on these primers can be readily gained and used to categorize the primers into different subgroups to help guide the establishment of suitable criteria for choosing certain primers and avoiding those with undesirable traits. For example, knowing the positions of all primers in the genome allows informed selection of primers either in particular regions or across the entire genome. Those primers that fall within the error subgroup or with oversized, unreliable, or undetected GAS can be readily excluded, whereas if selection was to rely on random selection, the primers would almost certainly fail; the 97.2 % overall amplification/detection success rate in this study was proof of these advantages when EST-SSR primers are selected based on the alignment results on a reference genome. It is worth noting that most of the forward and reverse primers in the error subgroup were aligned on different scaffolds, prompting speculation that paralogous genes with high identities in the genome might be responsible for a large proportion of the error subgroup, compared with those due to true sequencing/assembly errors. Further investigation is needed to clarify this speculation.
Options to improve the EST-SSR heterozygosity rate
Allelic heterozygosity and polymorphism among genotypes of interest are essential for linkage mapping and other segregation-based studies and are also more informative for other marker applications including marker-assisted selection (MAS), cultivar authentication, pedigree determination, phylogenetic analysis, and association genetics (Terwilliger et al. 1992; Ott and Rabinowitz 1997; Liu and Muse 2005). Although the primer failure rate was minimized by the approach used in this study, the homozygosity rate remained relatively high based on the H values obtained. The next goal must be to minimize the markers lacking heterozygosity and/or polymorphism among genotypes of interest. Some efforts have been taken to mine polymorphic EST-SSRs among highly redundant ESTs (Kong et al. 2012; Kayesh et al. 2013; Mohanty et al. 2013), which at least minimized selection of nonpolymorphic primers and likely increased the heterozygosity rate. One genotyping comparison indicated that the tri-, penta-, and compound types of EST-SSRs appeared to have much higher heterozygosity rates compared with the other EST-SSRs with even-numbered repeat units (Chen et al. 2008). Further comparison of heterozygosity and polymorphism rates among these EST-SSR markers with presumed deletions/insertions, introns, and other detectable subgroups might provide a new means to increase these rates. Specifically, genomic variation in introns that have little impact on gene structure and function may be particularly valuable for detection of more polymorphisms in intron-containing amplicons and for subsequent genetic analysis as well (Muthamilarasan et al. 2013).
The authors thank Bryan Blackburn, Luke Quick, and Minling Zhang for their technical assistance. The research is partially supported by the USDA National Program of Plant Genetic Resources, Genomics and Genetic Improvement (Project number 6606-21000-004-006) and an USDA National Institute of Food and Agriculture Specialty Crop Research Initiative project (2009-51181- 06036).
Data archiving statement
All Prunus EST sequences and accession numbers are available at the National Center for Biotechnology Information EST database (http://www.ncbi.nlm.nih.gov/nucest/?term=Prunus). The peach (Prunus persica) reference genome assembly (version 1.0) is available at the Genome Database for Rosaceae (http://www.rosaceae.org/species/Prunus_persica/genome_v1.0), so is the mined Prunus EST-SSR primer information (http://www.rosaceae.org/node/336118). The 10545 EST-SSR forward and reverse primers and the selected 288 primers are attached as ESM Tables 2 and 3, respectively.
- Blenda AV, Wechter WP, Reighard GL, Baird WV, Abbott AG (2006) Development and characterisation of diagnostic AFLP markers in Prunus persica for its response to peach tree short life syndrome. J Hortic Sci Biotechnol 81:281–288Google Scholar
- Chen X, Sullivan PF (2003) Single nucleotide polymorphism genotyping: biochemistry, protocol, cost and throughput. Pharmacogenomics J 3:77–96Google Scholar
- Chen C, Gmitter FG Jr (2013) Mining of haplotype-based expressed sequence tag single nucleotide polymorphisms in citrus. BMC Genomics 14:746Google Scholar
- Chen C, Bock CH, Beckman TG (2014) Sequence analysis reveals genomic factors affecting EST-SSR primer performance and polymorphism. Mol Genet Genomics. doi:10.1007/s00438-014-0875-8
- Doyle JJ, Doyle JL (1987) A rapid DNA isolation procedure for small quantities of fresh leaf tissue. Phytochem Bul 19:11–15Google Scholar
- International Peach Genome I, Verde I, Abbott AG, Scalabrin S, Jung S, Shu S, Marroni F, Zhebentyayeva T, Dettori MT, Grimwood J, Cattonaro F, Zuccolo A, Rossini L, Jenkins J, Vendramin E, Meisel LA, Decroocq V, Sosinski B, Prochnik S, Mitros T, Policriti A, Cipriani G, Dondini L, Ficklin S, Goodstein DM, Xuan P, Del Fabbro C, Aramini V, Copetti D, Gonzalez S, Horner DS, Falchi R, Lucas S, Mica E, Maldonado J, Lazzari B, Bielenberg D, Pirona R, Miculan M, Barakat A, Testolin R, Stella A, Tartarini S, Tonutti P, Arus P, Orellana A, Wells C, Main D, Vizzotto G, Silva H, Salamini F, Schmutz J, Morgante M, Rokhsar DS (2013) The high-quality draft genome of peach (Prunus persica) identifies unique patterns of genetic diversity, domestication and genome evolution. Nat Genet 45:487–494PubMedCrossRefGoogle Scholar
- Jung S, Staton M, Lee T, Blenda A, Svancara R, Abbott A, Main D (2008) GDR (Genome Database for Rosaceae): integrated web-database for Rosaceae genomics and genetics data. Nucleic Acids Res 36:D1034–D1040.Google Scholar
- Muthamilarasan M, Venkata Suresh B, Pandey G, Kumari K, Parida SK, Prasad M (2013) Development of 5123 Intron-length polymorphic markers for large-scale genotyping applications in foxtail millet. DNA Res 21:41–52Google Scholar
- Okie WR (1998) Handbook of peach and nectarine varieties: performance in the Southeastern United States and Index of Names. The National Technical Information Service, Springfield, VAGoogle Scholar
- Rozen S, Skaletsky H (2000) Primer3 on the WWW for general users and for biologist programmers. Methods Mol Biol (Clifton, NJ) 132:365–386Google Scholar
- Verde I, Bassil N, Scalabrin S, Gilmore B, Lawley CT, Gasic K, Micheletti D, Rosyara UR, Cattonaro F, Vendramin E, Main D, Aramini V, Blas AL, Mockler TC, Bryant DW, Wilhelm L, Troggio M, Sosinski B, Aranzana MJ, Arus P, Iezzoni A, Morgante M, Peace C (2012) Development and evaluation of a 9K SNP array for peach by internationally coordinated SNP detection and validation in breeding germplasm. PLoS One 7:e35668PubMedCrossRefPubMedCentralGoogle Scholar