Development of the SolSTW SNP array
The SolSTW SNP array combines SNPs from two discovery studies (Hamilton et al. 2011; Uitdewilligen et al. 2013). SNPs originating from Hamilton et al. (2011) were selected based on good performance in an earlier experiment using the SolCAP SNP array (data not shown) without any additional selection criteria. In contrast, the large set of 129,156 SNPs originating from Uitdewilligen et al. (2013) required stringent selection criteria since only a small subset could be selected. The high SNP density in potato allowed us to narrow down the number of potential SNP assays to 59,279 SNPs. Subsequently, redundancy amongst SNPs was reduced by clustering all SNPs according to SNP dosage as described by Uitdewilligen et al. (2013). This resulted in 7019 clusters and 5334 single SNPs (singletons). For around 5200 clusters, two or more SNP per cluster and gene were selected. Of the remaining approximately 1800 clusters, one SNP was selected and complemented with 2738 singletons, resulting in a total of 15,123 selected SNPs. SNPs originating from Uitdewilligen et al. (2013) will be referred to as PotVar SNPs. In Table 1 the attempted numbers of SNPs are shown.
Optimization of fitTetra with SolSTW array
Several runs were performed with fitTetra for genotype calling using the signal ratios obtained from the Infinium array. Over sequential runs, the programme settings were optimized and minor errors of the software were corrected. Two properties of the Infinium data initially resulted in erroneous clustering by the software. Firstly it appeared that the five clusters are not evenly distributed over the X-axis, as shown in Fig. 2a, b. In particular the three heterozygous clusters are closer to each other and relatively far from the two homozygous clusters. Secondly, the signal of the homozygous clusters is biased and not exactly at 0 or at 1 as shown in Fig. 2g, h. These modifications of the software are processed in the publically available version of fitTetra since autumn 2013 (https://www.wageningenur.nl/en/show/Software-fitTetra.htm).
Analysis of the SolSTW array with optimized fitTetra software
The improved version of fitTetra was used for the genotype calling of the SolSTW array. The genotype calling was performed twice, once using all genotypes and a second run without the diploid genotypes. The genotype calling without the diploid samples was used for further analysis, as inclusion of the diploid samples resulted in an additional rejection of 1184 markers, due to deviation from a Hardy–Weinberg test by fitTetra. The analysis of the tetraploid samples resulted in 15,271 fitted and 2716 rejected markers. Subsequently, a bi-parental tetraploid mapping population was used to identify SNPs where parental SNP dosage and offspring ratios were in disagreement. This is a putative indicator of poor SNP performance, and visual inspection of GenomeStudio output as shown in Fig. 2 resulted in the rejection of another 378 SNPs. In addition 1832 markers with a call rate below 95 % in fitTetra were visually inspected using GenomeStudio. The remainder of 6041 SNPs with good Mendelian fit and call rate >95 % were assumed to be good calls, and visual inspection was omitted. For the visual inspection fitTetra output was used as shown in Fig. 2b, d, f, h. In these figures diploid samples are illustrated with grey bars. The position on the X-axis of the diploids allows one to identify potentially poor markers, when diploid samples are in simplex or triplex clusters. As shown in Fig. 2d, f the diploid samples do not cluster together in the nulliplex, duplex or quadruplex clusters and therefore markers like these were removed. This incorrect clustering of diploids was predominantly observed in markers with more than 5 clusters as shown in Fig. 2e, f or markers with “clouds” of data points as shown in Fig. 2c, d. For 1206 of the 1832 markers with >5 % missing calls, visual inspection resulted in the removal of the markers from the final dataset. For 626 markers, fitTetra produced false negative genotype calls based on correct marker signal intensities. Such markers were manually re-scored using GenomeStudio. The 2716 rejected markers were visually inspected with fitTetra output as shown in Fig. 2, and scored manually if the marker was mistakenly rejected. This resulted in the recovery of 843 markers. Of these 843 markers 689 had an allele frequency below 1 %, therefore these were correctly rejected based on the peak.threshold setting in fitTetra of 0.99. The remaining 154 were mistakenly rejected for unknown reasons.
Reproducibility of genotype calls
As shown in Tables 1 and 2 the data collection with fitTetra and GenomeStudio resulted in a final dataset with a high number of 14,530 SNP markers. The genotype calls of the 39 replicated tetraploid samples showed a high concordance between replications. On average, only 3.3 calls (0.02 %) differed between the replicated samples of which 60 % are differences within the heterozygous clusters. Additionally for 74 (0.5 %) markers on average there was no call for either of the genotypes. The 26 replicates of the internal diploid control also showed highly concordant results. We observed seven markers with a deviating observation. In addition, we observed 66 markers with one or more missing calls, of which 50 % were caused by two of the twenty-eight replicates.
The percentage of missing calls was very low for the final dataset of 14,530 markers and 537 genotypes, with only an average of 95 missing calls per genotype and 3.5 missing calls per marker (0.65 %). For genotypes having wild species in their pedigree and not used in the SNP discovery panel of Uitdewilligen et al. (2013), the average number of missing values was much higher (184).
Analysis of factors influencing assay failure
Several possible factors that could cause assay failure have been examined. In Table 1 percentages of assay failure are shown based on the origin of the SNP assay. What is clearly visible is that the SolCAP SNPs originating from the 8303 array are most successful (94.0 %), because these SNPs were tested before with the Infinium platform. The non-pre-tested SNPs from Hamilton et al. and the SNPs originating from the SNP discovery study of Uitdewilligen et al. (2013) show a lower percentage of successful assays (82.5 and 77.5 %, respectively). However, when considering markers in coding regions only, the assay failure rate of PotVar SNPs is much lower (11.4 %, Table 3). For SNPs that were manually developed the majority failed (70 %), this could be explained by the location in R-genes, which are members of a large highly variable gene family. In Table 3 percentages of assay failure of 12,272 SNPs are shown based on their localization in coding or non-coding regions, as well as based on their chromosomal position on the pseudomolecules (Sharma et al. 2013). The latter can be divided in euchromatin, pericentromeric heterochromatin and the border between the two. It is clear that SNPs localized in the pericentromeric heterochromatin are more likely to fail. However, more significant is the low percentage of assay failure in coding regions compared to non-coding regions.
The high nucleotide diversity of potato implies that SNP assays may be frequently affected by flanking SNPs. Therefore we aimed to target SNPs without flanking SNPs for assays, this is however problematic in potato due to its high SNP density. Consequently for many (34.8 %) SNP assays (originating from Uitdewilligen et al. 2013) on this array, known flanking SNPs are present. In Fig. 3a the percentage of assay failure of these PotVar SNPs is shown as a function of the distance of the flanking SNPs. This graph shows a trend where flanking SNP distance is correlated with assay failure. Additionally in Fig. 3b a correlation is shown between assay failure and the number of flanking SNPs. An increase in assay failure with more flanking SNPs can be observed. In addition the GC content was compared between successful and failed SNPs, however there was no significant relation between assay failure and GC content.
The allele frequency distribution of SNPs across the 537 genotypes is shown in Fig. 4. PotVar SNPs, shown in the distribution (wide bars, left Y-axis) and SolCAP SNPs (narrow bars, right x-axis) differ greatly in allele frequency. PotVar SNPs are split in pre-1945 (dark blue) and post-1945 (green) SNPs. The average allele frequency of PotVar SNPs is 11 % and for SolCAP 22.7 %. This large difference in allele frequencies, also shown in Table 4, is not surprising since we deliberately did not exclude SNPs with a low allele frequency, clearly these were selected against in the design of the SolCAP array.
Identification of pre-1945 and post-1945 variation
The comprehensive sampling of the gene pool of cultivated potato allowed the evaluation of changes of the composition of the gene pool over time. This resulted in the identification of SNP markers, which are the result of introgression breeding and SNP markers that represent the initial genetic diversity within the founders of the contemporary gene pool. A SNP that is polymorphic in one of the 48 cultivars released before 1945 is hereafter referred to as “pre-1945” SNP. This genetic variation most likely represents the material that was brought to Europe from the Americas between the 16th and the 19th century. A SNP marker that is monomorphic in one of the 48 old cultivars, but polymorphic in more recent cultivars/progenitors is hereafter referred to as “post-1945” variation. In Table 4 the large difference in allele frequency is visible between the post-1945 SNPs (average MAF = 1.4 %) and the pre-1945 SNPs (average MAF = 18.0 %). In Table 2 the numbers and percentages of post-1945 SNPs per chromosome are shown. In total 3500 (3281 PotVar + 219 SolCAP) SNPs are post-1945, which corresponds to 24.1 % of the SNP markers in this array. The detection study of Uitdewilligen et al. (2013) made a large contribution to this group of post-1945 SNPs (Table 2). The 219 post-1945 SNPs contributed by SolCAP are mostly introduced by cultivar Lenape (114 SNPs), of which two descendants (Atlantic and Snowden) were included in the discovery study of Hamilton et al. (2011). The chromosomal positions of post-1945 SNPs were analysed. It appears that post-1945 SNPs cluster together on chromosomes and in genotypes. In Fig. 5, a genome-wide plot is shown of the location of introgression segments first observed in six genotypes. Introgression segments differ greatly in size, ranging from very small (Y-66-13-636) to complete chromosomes (VTN 62-33-3). A nice example is the 97 SNPs first observed in Craigs Bounty (1946). This figure shows 95 SNPs in three introgression segments on chromosomes 5 (green), 10 (dark blue) and 12 (grey). Ten genotypes (VTN 62-33-3, Lenape, Mara, Urgenta, VE 71-105, AM 78-3704, Maris Piper, Craigs Bounty, Ulster Glade, VE 66-295) are responsible for the introduction of 50 % of post-1945 SNPs. A full table with numbers of SNP introduced per cultivar is shown in supplementary file 3.
Processes that shape the genetic composition of the contemporary gene pool of potato
Several processes are shaping the contemporary gene pool of potato, such as the introduction of new genetic variants by introgression breeding. Introgressions cause the loss of existing variants by substitution. Selection will also influence the allele frequency, including breeding for specific market niches (e.g. starch cultivars). In specific market niches, the limited gene pool is easily affected by random genetic drift (genetic erosion). These processes (introgression/substitution, selection, drift) were studied by comparing SNP allele frequencies between two groups. Firstly, the pre-1945 cultivars were compared with the cultivars released after 2005. Also, the pre-1945 cultivars were compared against cultivars from the “starch” subpopulation. For post-1945 SNPs significant increases of the allele frequency can be observed. In this study we analysed 246 cultivars that were released between 1946 and 2005. In this group, 108 cultivars contributed post-1945 SNPs, ranging from 1 to 447 post-1945 SNPs per cultivar (Supplementary file 2). From these 108 cultivars 39 are shown in Fig. 6 and arranged in the order of market introduction. These 39 cultivars are donors of those post-1945 SNPs that have attained the largest increase in allele frequency within the 242 cultivars released after 2005. The negative slope perceived in Fig. 6 indicates that introgression segments introduced soon after 1945 could assume a higher allele frequency (up to 19 %) as compared to more recently introgressed haplotypes (up to 4 % increase). This suggests that a prolonged presence of a beneficial haplotype introgressed in the gene pool results in increasingly higher allele frequencies due to positive selection. Please note that a 4 % increase in allele frequency implies that almost 20 % of the cultivars carry this haplotype in simplex condition, whereas a 19 % increase implies that more than half of the cultivars are simplex or duplex and occasionally triplex.
In contrast, 50 % of all post-1945 SNPs remain below an allele frequency of 1 % and 549 SNPs were not polymorphic anymore (nulliplex) in cultivars released after 2005. These 549 SNPs could be considered as lost, i.e. phased out soon after introduction. For the pre-1945 SNPs, 538 SNPs (4.9 %) were no longer polymorphic in contemporary cultivars. These SNPs are also assumed to be lost during breeding. This may be due to selection, but random genetic drift is also plausible, because the initial allele frequency of these SNPs in old germplasm was already very low (1.4 % on average).
A comprehensive overview of the changes in allele frequency of all pre-1945 SNPs (in post-2005 and starch cultivars compared with old cultivars) is shown in Fig. 7. The largest column in the middle of the figure shows that the majority of the SNPs (6441 or 42 %) hardly changed in allele frequency during a century of potato breeding. Starch cultivars show somewhat larger fluctuations in allele frequencies, because of an emphasis on introgression breeding for nematode resistance along with founder effects (discussed below). Figure 7 also suggests that larger numbers of SNPs have declined, as compared to the number of SNPs that show an increased allele frequency. This suggests that broadening of the genetic diversity by introgression since 1945 results in an overall net decrease of the frequency of pre-1945 haplotypes. In addition to these allele substitutions, founder effects may also reinforce this fluctuation. In Fig. 8 the change in allele frequency in the “starch” subpopulation is plotted against the allele dosage of an important progenitor (VTN 62-33-3). The figure clearly shows that a higher dosage of a SNP in an important founder contributes to the gain in allele frequency over time. The correlation between SNP dosage in a specific founder and allele frequency gain within the “starch” subpopulation was strongest for VTN 62-33-3 and AM 78-7804, two frequently used progenitors.
The processes underlying allele frequency changes over time: introgression, substitution, selection, drift and founder effects (frequent use of parents) are highly confounded. Still we assume that SNP variants that show the greatest increase in frequency are linked to important alleles for agronomic performance and vice versa. The most striking observations are that (1) 95.1 % of the pre-1945 SNPs are still polymorphic after 50–150 years of breeding and (2) we do not observe any fixation of pre-1945 SNPs in cultivars released after 2005.