Introduction

Molecular marker variability, using isozyme genes (Byrne 1990) and simple sequence repeat (SSR) markers (Mnejja et al. 2010), has shown that peach is the least genetically variable of the Prunus crops, that also include apricot, cherry, Japanese plum, and almond. The fact that the gametophytic self-incompatibility system is not operative in peach but functional in the other species results in a high level of selfing (Miller et al. 1989). Homozygosity is a consequence of selfing which, when coupled with selection for different agronomic characters and for progeny phenotypic uniformity, leads to erosion of the genetic variability. In addition, the cultivars currently commercialized in Europe and America come from a very limited gene pool, used by the initial US breeders about one century ago (Scorza et al. 1985), resulting in a bottleneck that further diminished the level of variability.

A large set of peach cultivars from Europe and North-America has been analyzed with SSRs by Aranzana et al. (2003a; 2010). Despite having a level of variability sufficiently high for the individual identification of virtually all cultivars, these SSRs were found to be relatively less variable than in other species. The collection of cultivars studied was structured in subpopulations, generally corresponding to certain key commercial characters: peaches, nectarines and non-melting flesh (canning) peaches. High conservation of linkage disequilibrium has also been detected with a collection of 50 SSRs (Aranzana et al. 2010), as expected considering the bottleneck that occurred at the beginning of modern peach breeding.

DNA sequence variability was studied in a collection of 47 cultivars selected to be representative of the variability of the species on the basis of SSR variability using a set of 23 peach DNA sequences, RFLP genomic probes, and ESTs, of known position on the map (Dirlewanger et al. 2004; Illa et al. 2011) and genome (http://www.rosaceae.org/). Two of the EST sequences, corresponding to a pectate lyase and a sucrose synthase gene identified by Illa et al. (2011) as candidate genes for fruit texture and fruit glucose content, were studied at the whole sequence level. These results provide a first insight into the sequence variability of peach and allow us to study the variability of haplotypes in this species where high linkage disequilibrium (LD) conservation is expected.

Material and methods

Plant material

To evaluate the levels of sequence variability in commercial peaches, 47 peach varieties were selected from a collection of 224 previously analyzed with 50 SSR markers (Table 1). These varieties have been shown to be genetically distant and representative of different subpopulations (Aranzana et al. 2010). Genomic DNA was isolated from young leaves as previously described by Viruel et al. (1995).

Table 1 Characteristics of the 47 peach cultivars used

DNA sequencing

Genome-wide sequence variability

Among the sequences available in Prunus at the Genome database for Rosaceae (http://www.rosaceae.org/), we selected 40 regions sequenced in peach, evenly distributed along the Prunus reference map. Ten of them derived from RFLP genomic probes and the rest (30) from ESTs. Specific primer pairs (Table 2) were designed for each region using the Primer3 software (Rozen and Skaletsky 2000; http://frodo.wi.mit.edu/) to amplify fragments of about 450 bp, avoiding amplification of SSR regions.

Table 2 Description of the RFLP genomic probes and ESTs sequenced in 47 peach cultivars

The primers were first tested in two peach varieties, “Alexandra” and “Calante”, identified as high and low heterozygous, respectively, with SSRs (Aranzana et al. 2010). For sequencing, 40 ng of peach genomic DNA were first amplified in a total volume of 20 μl with 1XPCR buffer, 1.5 mM MgCl2, 0.5 mM dNTPs, 0.25 μM of each primer, and 1.5 U of GoTaq® (Promega) using the following conditions: 2 min at 94°C; 35 cycles of 15 s at 94°C; 1 min at the appropriate annealing temperature; and 1 min at 72°C, followed by a final extension step of 5 min at 72°C. PCR products were purified using SephadexTM G-50 (GE Healthcare Life Science) as described by Till et al. (2006). DNA quantity was measured using a spectrophotometer (NanoDrop technologies, Wilmington, DE, USA) and confirmed by electrophoresis on 1 % TBE agarose gel. Forward primers were used for sequencing the fragments using the BigDye® Terminator Cycle Sequencing Kit (Applied Biosystems, Foster City, CA, USA), according to the manufacturer's protocol, in an ABI Prism® 3130xl Genetic Analyzer (Applied Biosystems, Foster City, CA, USA). Sequences were visualized and manually edited with Sequencher 4.8 software (Gene Codes Corporation; Ann Arbor, MI, USA). Fragment ends were trimmed to remove low-quality sequence. Among the analyzed sequences, the 23 yielding high quality, unique sequences were selected (Table 2) and sequenced in 47 varieties.

To identify the coding and non-coding regions, RFLP sequences were blasted against the Populus genome database (http://www.populus.db.umu.se/) and the NCBI site (http://blast.ncbi.nlm.nih.gov/Blast.cgi). Amplified EST sequences were also aligned against the FASTA sequences from which the primers were designed to detect intronic regions.

Sequence variability of two candidate genes

Two of the polymorphic fragments corresponded to the CGPAA2668 (pectate lyase 1, candidate for fruit firmness) and CGPPB6189 (sucrose synthase 1, candidate for sugar content) candidate genes (Illa et al. 2011) and were selected for whole gene sequencing. To amplify both genes, primers were designed by blasting the candidate gene fragments with the ESTs available in the GDR database (Jung et al. 2008). CGPAA2668 and CGPPB6189 were fully amplified in the 47 varieties with three and four primer pairs, respectively. The resulting amplified fragments were sequenced with four and seven primers (Table 3). Results were aligned with Sequencher 4.8 software (using the large gap algorithm) with additional manual adjustments in the case of long insertion/deletion (indel) polymorphisms.

Table 3 Primer pairs used for amplification pectate lyase 1 (PpPL1; 2,239 bp of consensus sequence) and sucrose synthase 1 (PpSUS; 3,941 bp)

Sequence variability analysis

For each single nucleotide polymorphism (SNP), allelic and genotypic frequencies and observed and expected heterozygosity (Ho and He, respectively) were calculated and deviation from the Hardy–Weinberg equilibrium (HWE) was tested. Two of the polymorphic fragments detected contained more than one SNP in heterozygosis. In each fragment, the SNPs were linked in a whole haplotype and consequently there was no phase ambiguity. Ho was calculated for each cultivar. Additionally, for each polymorphic fragment, we calculated two estimates of nucleotide polymorphism by quantifying the number of segregating sites, θ W (Watterson 1975), and the nucleotide diversity, π, i.e., the mean percentage of nucleotide differences among all pairwise comparisons (Nei 1987). To allow comparison between different regions, we estimated these parameters for each site. Neutrality of the mutations was tested through Tajima's D statistic (Tajima 1989). These parameters were calculated with the software DnaSP v5 (Librado and Rozas 2009).

Variability comparison between SSRs and SNPs

To compare the variability detected by SSR and SNP markers, we selected the six SSRs closest to the fragments found to be polymorphic (Supplementary data, Table S1) as described in Aranzana et al. (2010).

HWE deviation of the SSR and SNP markers in the 47 cultivars was analyzed with GDA software (Lewis and Zaykin 2001). Additionally, two genetic distance matrices were constructed with the NTSYSpc v 2.10t program (Rohlf 1994) with SSR and SNP data for all the analyzed cultivars as described in Aranzana et al. (2010). Both matrices were compared through a two-way Mantel test with the MxComp procedure of the NTSYSpc V. 2.10t program.

The correlation between the heterozygosity levels detected with both types of markers was calculated with the JMP software package version 8.0.1 (SAS Institute Inc, Cary, NC) by the REML method.

Results and discussion

Sequence variability

In total, we sequenced 23 DNA regions in 47 cultivars, obtaining 8,379 bp/cultivar (i.e., 393,813 bp sequenced as a whole), 4,677 bp corresponding to non-coding regions and 3,702 bp to coding regions (Table 4). Nucleotide variation was observed in seven out of the 23 sequenced fragments (30 %), with 14 SNPs and two indels, corresponding to one SNP every 598 bp and one indel every 4,189 bp. As expected, variability in non-coding regions was higher than in coding regions (Ching et al. 2002; Lijavetzky et al. 2007; Micheletti et al. 2011), with one SNP every 390 non-coding bp versus one in 1,850 bp in coding DNA. According to these data, the proportion of fragments found with at least one polymorphism is much lower than that in other species also subjected to bottlenecks and strong selection, such as sugarcane where sequencing projects have found 86–94 % of the fragments (depending on the sample set) to be polymorphic (Bundock et al. 2009). Similarly, SNPs were observed at a lower density compared with other crops such as melon, tomato, grape, maize, or apple (Table 5). The observed low levels of sequence variability are consistent with those obtained using molecular markers such as AFLPs and SSRs (Aranzana et al. 2003b, 2010) in peach and with isozymes (Byrne 1990) and SSRs (Mnejja et al. 2010) in other Prunus species. Direct sequencing of genomic fragments (usually ESTs) as a tool for SNP discovery has been successfully used in different plant species; however, the low number of polymorphic fragments and SNP density found here implies that this method may be less efficient in peach, supporting the need for high-throughput sequencing strategies for this purpose.

Table 4 Polymorphism of SNPs and indels found in a set of 47 peach cultivars
Table 5 Comparison of SNP variability in various plant species

All of the SNPs were found to be biallelic, 64 % due to transitions and 36 % to transversions. Although, probabilistically, the expected proportion between transitions and transversions is 1:2, a bias towards transitions is frequently observed, probably as a consequence of greater purifying selection against transversions (Keller et al. 2007) that may vary for different organisms (Strandberg and Salter 2004). The transition/transversion ratio observed here (1.77) is similar to that observed in grape (1.56 by Salmaso et al. 2004 and 1.46 by Lijavetzky et al. 2007) and potato (1.5 by Simko et al. 2006) and higher than that observed in apple (1.27 by Micheletti et al. 2011).

The number of polymorphic sites in polymorphic loci (including SNPs and indels) varied from one to six, with an average of 2.3 (Table 4). Five out of the seven polymorphic fragments contained only one polymorphism, all of them in non-coding DNA. In contrast, the two remaining loci (CGPPB6189 and AG112) were highly polymorphic, the former with five SNPs, two of them in coding DNA, and the latter with four SNPs and two indels (of 1 bp and 2 bp), all occurring in non-coding DNA. As sequences were obtained from PCR-amplified genomic DNA, each sequence contained the two DNA strands. This can produce phase ambiguity in the case of multiple polymorphisms per fragment, and if large indels occur in heterozygosis, base calling becomes unfeasible. However, in both loci, the homozygous genotypes showed that all SNPs were linked, yielding two haplotypes per fragment and, consequently, three genotypes, leaving no ambiguity for phase determination. Moreover, in the AG112 fragment, the cultivars carrying the less frequent haplotype in homozygosis were also homozygous for the two indels, suggesting that they were linked to the SNPs, so we can assume that the heterozygous cultivars for the SNP variants were also heterozygous for the same two indels observed in the homozygous genotypes.

Genetic variability, measured as θ w, gave values ranging from 0.0003 to 0.0035 with an average of 0.00129 (Table 6, Fig. 1). Nucleotide diversity, π, is a parameter that depends on the number of SNPs as θ w, and also on their frequency. These values were low, ranging from 1.7 × 10−4 (in CGPPC2807) to 6.8 × 10−3 (in CGPPB6189) with an average of 2.1 × 10−3 , i.e., we expect two randomly chosen sequences of 1,000 bp selected from one of the polymorphic fragments to differ, on average, in about two sites.

Table 6 Variability parameters of seven polymorphic DNA sequences in peach
Fig. 1
figure 1

Estimates of variability (π and θ) for the polymophic fragments and the candidate genes PpPL1 and PpSUS fully sequenced

Observed mean θ w values were similar to those reported for soybean (0.00097; Zhu et al. 2003) and about 3.5 and 7.5-fold lower than that observed in grapevine (0.0046 Lijavetzky et al. 2007) and maize (0.0096; Ching et al. 2002). When taking into account allele frequencies, there was more similarity in variability levels (measured as π), with peach being a little over 1.5 times higher than soybean (π = 0.0012) and just two and three times lower than grapevine and maize (π = 0.0051 and 0.0063, respectively). This enhancement is a consequence of the relatively high allele frequencies observed. Here, SNP minor allele frequencies (MAF) ranged from 0.036 (in CGPPC2807) to 0.420 (in CGPPC7741) with a mean value of 0.225 (when considering only unlinked SNPs), and 64 % of them had a frequency higher than 0.2, while for example, in grapevine, 50 % of alleles had MAF > 0.2 (Lijavetzky et al. 2007) (Fig. 2). This contrasts with allele frequencies observed in species more variable than peach, such as apple, with 26–42 % of alleles with MAF > 0.2 after re-sequencing two different sets of M. x domestica germplasm (Micheletti et al. 2011). Our results suggest that a large proportion of the SNPs discovered on sequencing peach occidental germplasm will be useful for association mapping purposes, where MAF is usually set at ≥5 %, as well as for inclusion in large-scale genotyping platforms, where only robust SNPs are desired.

Fig. 2
figure 2

Comparison of SNP allele frequency distribution in peach and grapevine

Tajima's D statistics, which detects departure from neutrality of mutations by comparing θ w and π estimates, gave values ranging between −0.778 and 2.078, with a mean of 1.06. Under neutral equilibrium, Tajima's D is expected to be zero. Significant departure from neutrality (p ≤ 0.05) was only observed in one of the fragments, CGPPB6189, with 2.078. This fragment amplifies part of a candidate gene that encodes a sucrose synthase. Positive Tajima's D values can indicate balancing selection, which tends to maintain several alleles at intermediate frequency (Wright and Gaut 2005). However, SSR data in the region around this locus (linkage group 7, 7:56 bin) do not show an increase of heterozygosity compared to other genomic regions (data not shown).

At each of the seven polymorphic loci, only two alleles (haplotypes) were amplified. Ho ranged from 0.023 to 0.477 (mean Ho = 0.263) and He from 0.022 to 0.487 (mean He = 0.307). These values are lower than those observed with SSR markers in commercial peach varieties, where Ho and He have been estimated to be 0.35 and 0.46, respectively (Aranzana et al. 2010), such that single SSRs are more informative than single SNPs for variability studies. This has been observed elsewhere. For example, Laval et al. (2002) calculated that k-1 times more biallelic markers are needed to achieve the same genetic distance accuracy as a set of microsatellites with k alleles. In peach, the average number of alleles per SSR ranges from about 3.5 to 7.3, depending on the cultivars and SSRs used (Wünsch et al. 2006; Dirlewanger et al. 2002; Testolin et al. 2000; Sosinski et al. 2000; Aranzana et al. 2003a, 2010). This means that, to get the same accuracy as with 100 SSR markers, we would need between 250 and 630 SNPs.

With population admixture, genotypic frequencies may deviate from those expected under panmixia. Testing for HWE as a measure of population admixture may predict false positives in association studies (Deng et al. 2001; Tiret and Cambien 1995). All SNPs found here were in HWE (considering p ≤ 0.05). This contrasts with the generalized departure from HWE previously detected with SSRs in peach (Aranzana et al. 2003a). HWE equilibrium departures can be caused by intrinsic factors in the studied sample, such as population admixture and selection, but also by specific marker characteristics such as mutation rates (Deng et al. 2001). In the case of selection, HWE departures will not only affect the marker but also a relatively large region around the genomic region analyzed. For a more realistic comparison of SNPs and SSRs, we selected a set of six SSRs from those analyzed by Aranzana et al. (2010), adjacent to the polymorphic fragments, and reanalyzed them in the same set of varieties. Three of the SSRs departed from HWE (p ≤ 0.05): BPPCT020 about 153 kbp from AG105, BPPCT038 464 kb and 469 kb from CGPAA2668 and CGPPC2807, respectively, and UDP96-008 2.2 Mbp from CGPPC4457. LD in peach has been estimated to extend 13–15 cM (Aranzana et al. 2010). Considering a rough correspondence of 430 kbp/cM, LD extends about 5.59–6.45 Mbp, so we can consider that the analyzed SSRs and SNPs are linked. This means that the departure from HWE is more probably due to their different mutational properties.

The average heterozygosity of the cultivars was 0.28, ranging from 0 to 0.71. The most heterozygous were “Aline”, “Early Elberta”, “Redwing” and “Tendresse”, whereas “Admiral Dewey”, “Andross”, “Binaced”, “Calante”, and “Festina” were homozygous for all of the sequenced fragments (Table 1). “Admiral Dewey” and “Calante” were also homozygous at the six SSRs tested, as were “Escarolita” and “Rio Oso Gem”. In contrast, “Aline”, “Elberta”, and “Chinese Cling” were heterozygous at all six SSRs. The correlation of the heterozygosity found between SSR and SNP data was low (r = 0.504). To compare SSR and SNP variability data, a distance matrix was also constructed for both SNP and SSR data. On comparing both matrices through a Mantel test, no correlation between them was observed (r = −0.057). The reason could be that SNP-based distances are due almost entirely to drift, while SSR-based distances are also due in part to mutation (Hamblin et al. 2007). The low correspondence between both types of information is shown graphically in Fig. 3, where an SNP matrix alignment is plotted against the SSR distance tree.

Fig. 3
figure 3

SNP alignment matrix and UPGMA tree constructued with six SSRs close to the sequenced polymorphic DNA fragment

Up to now, most peach variability has been assessed with SSRs. This information is now being used to select varieties to be included in sequencing projects, such as those oriented to SNP discovery. Our results suggest that the selected varieties may not fulfill the expectations concerning variability and heterozygosity that SSRs predict.

Variability at two candidate genes

Among the seven polymorphic fragments found, two were candidate genes for important economical characters and the whole gene was sequenced in the 47 cultivars (Fig. 4). One (CGPAA2668, accession number CB822668) corresponded to a region of a pectate lyase, a gene involved in cell wall degradation and fruit softening which is consequently considered to have a role in ripening (Marin-Rodriguez et al. 2002). A quantitative trait locus (QTL) for flesh firmness has been detected in the region of linkage group 5 that contains this gene in an F2 population between the two peach cultivars Ferjalou Jalousia® and Fantasia (E. Dirlewanger, pers. comm.). The other candidate gene (CGPPB6189, accession number AJ876189) was also polymorphic in peach varieties with five SNPs and encodes a sucrose synthase, a central enzyme in the metabolic interplay of sucrose, hexoses, and starch synthesis. This gene co-localizes with three QTLs (glucose, fructose, and sucrose) mapped in linkage group 7 of the Prunus map in an advanced backcross between Prunus persica cultivars and the wild relative species Prunus davidiana (Quilot et al 2004).

Fig. 4
figure 4

Scheme of PpPL1 and PpSUS genes. Grey boxes represent exons. Full triangles represent SNPs and empty triangles indels

The whole sequence of the pectate lyase gene (PpPL1) was obtained from sequencing four fragments (including CGPAA2668). The consensus sequence contained 2,239 bp: 1,000 bp corresponded to intronic and flanking regions and 1,239 bp to coding DNA distributed in four exons. In total, the whole fragment contained two SNPs (i.e., 1 SNP every 1,119 bp), one in non-coding DNA and the other in coding DNA producing a synonymous replacement. All mutations were linked in a whole haplotype, observing two haplotypes and three genotypes (the two homozygous plus the heterozygous). The most common allele had a frequency of 60 %. Nucleotide diversity (π) was in the range of values observed in the seven polymorphic fragments (0.00043) while θ w was much lower (0.00017).

The whole sequence encoding the sucrose synthase (PpSUS) was obtained from sequencing seven fragments, producing a consensus sequence of 3,984 bp (3,941 bp excluding gaps): 1,569 bp corresponded to intronic and flanking regions and 2,415 to coding DNA distributed in 13 exons. In total, the whole fragment contained 15 SNPs and four indels (one SNP every 263 bp and one indel every 985 bp). Seven of the SNPs occurred in six introns and eight in three exons, all were synonymous changes except one which produced the replacement from a lysine (the most frequent) to an asparagine. This replacement is likely to have a limited effect on the PpSUS enzyme activity due to the similar physicochemical properties of these two amino acids. All indels occurred in non-coding regions, three in introns and one in the 3′UTR. All SNPs were linked in a whole haplotype. Three of the indels were linked to the SNPs while the fourth, of 19 bp, was only observed in the Spanish landrace “Jesca”. No recombination was detected in the whole fragment, with three haplotypes and four genotypes observed. After removing indels, two haplotypes and three genotypes were observed. Nucleotide diversity (π) and θ w values were within the range of values observed in the seven polymorphic fragments (π = 0.00146, θ w = 0.00074) (Fig. 1). The most abundant SNP haplotype had a frequency of 74.5 %, with 28 of the varieties homozygous for the most frequent allele (haplotype) and 14 for the less frequent.

For both genes, Tajima's D values, indicative of selection, were significantly higher than zero (2.301 and 2.696 for the pectate lyase and the sucrose synthase, respectively; p ≤ 0.05), indicating an excess of alleles at intermediate frequency, possibly due to balancing selection.

Here, we observed relatively low SNP polymorphism in peach, consistent with the low variability previously described in the species. One of the two genes sequenced had an SNP density higher than that observed at genome-wide level and a possible pattern of selection was observed in both. This, together with their map position and putative gene function make them good candidates for affecting the phenotype. To provide additional evidence on the causal effects of these genes on the peach fruit phenotype, a larger sample of cultivars should be genotyped and phenotyped for different components of fruit firmness and sugar content to detect association between these two genes and the phenotype in which they could be involved.