Evaluation of SNP chip and KASP assay genotyping accuracy by Sanger sequencing
Two haplotype blocks associated with root dry biomass on chromosome 5B of hexaploid wheat were described by Voss-Fels et al. (2017), based on GWAS using the 90 k SNP Illumina Infinium array. Haploblock Hap-5B-RDMa was defined based on SNP calls from 9 SNP chip probes while haploblock Hap-5B-RDMb was defined based on SNP calls from 6 SNP probes. In a set of 215 diverse winter wheat genotypes, 7 different haplotypes were identified for haploblock Hap-5B-RDMa and 9 different haplotypes for haploblock Hap-5B-RDMb (Table 1). Most of the haplotypes for both haploblocks with lower frequencies contained SNP alleles were called by Voss-Fels et al. (2017) as ‘Null’ alleles (10 of 11 below 5% frequency, Table 1) for at least for one probe.
Table 1 Comparison of haplotype variants for haploblocks Hap-5B-RDMa and Hap-5B-RDMb detected by SNP chip (left), KASP marker analysis and Sanger sequencing (right) in a panel of 215 wheat genotypes The low frequency of some haplotypes harboring ‘Null’ alleles suggested that the failed calls in SNP chip genotyping are either due to hybridization or SNP calling artifacts and not due to genuine ‘Null’ alleles resulting from presence/absence variation for the 50 bp region complementary to the SNP probe. To test this hypothesis in a case study, KASP assays were designed for two affected SNP probes of Hap-5B-RDMa and for all 3 affected SNP probes of Hap-5B-RDMb (Table 1). These assays were applied to 19 genotypes including different haplotype combinations for Hap-5B-RDMa and Hap-5B-RDMb. In contrast to the SNP chip genotyping results, none of the 5 tested KASP assays failed to call one of the two major alleles, neither did they ever called a ‘Null’ allele (Table 1 on the right). However, this might also be due to amplification of the allele-specific primer from a matching complementary nucleotide following the ‘Null’ allele position. To confirm the correct allele composition of these genotypes, the SNP flanking sequences were analyzed by Sanger sequencing for all 23 genotypes harboring genuine ‘Null’ alleles that covered the full sequence complementary to 8 probes. Sanger sequencing revealed that all ‘Null’ alleles were incorrectly called by SNP array genotyping and that the KASP assays always called the correct alleles (Supplementary Table S3).
Applying the corrected genotype data from the KASP marker analysis, the number of haplotype variants was reduced from seven to two observed haplotypes within haploblock Hap-5B-RDMa (Hap-5B-RDMa-h1 and -h2) and from nine to four haplotypes within haploblock Hap-5B-RDMb (Hap-5B-RDMb-h1, -h2, -h3 and -h8) (Table 1). Only one of 11 putative low-frequency haplotypes (< 5%) was confirmed by KASP genotyping and Sanger sequencing (h8 for haploblock Hap-5B-RDMb, frequency 0.5%). Recalculation of the haploblock-trait association for root dry biomass using the corrected KASP marker genotyping data increased the correlation coefficient (adjusted R2) from 5.6% to 12.5% for haploblock Hap-5B-RDMa and from 2.6 to 9.5% for haploblock Hap-5B-RDMb (Fig. 1). Also, linkage disequilibrium analysis based on the correlation of newly developed KASP marker pairs for the panel of Voss-Fels et al. (2017) revealed no indication that the organization of the markers associated with the trait in two haploblocks on chromosome 5B is not correct.
Identification of trait-associated nucleotide alleles and calling errors from SNP chip hybridization data
Conversion of SNP chip markers used in GWAS and biparental QTL mapping into breeder-friendly marker systems requires determination of the nucleotide allele identity associated with the increase/decrease in the trait of interest, sourced either from published studies or from SNP chip databases. However, allele identification is not always simple due to the use of different formats applied for reporting the SNP and allele identities. For example, in the CropSNP database (http://snpdb.appliedbioinformatics.com.au), genotype data for Illumina wheat Infinium array SNP calls are reported either as AA/AB/BB allele patterns (representing cluster positions) or as predicted nucleotide allele patterns (e.g., AA, AT, TT) (Scheben et al. 2019). Some commercial service providers also report the allele patterns in IUPAC one letter code (e.g., A, W, T). In addition, if using a commercial service provider, the raw data might not be provided and processed SNP calls are reported to the customer based on a cluster file developed or optimized by the company (e.g., providing SNP calls cleaned from hemi-SNP calling patterns, including processing artifacts). Also, most publications on GWAS or biparental QTL mapping do not contain enough additional information to directly infer the SNP probe composition and SNP nucleotide allele identity associated with an increase/decrease in a target trait to derive other marker types. One reason is that strand-specific identity for a SNP allele cannot unambiguously be reported for hybridization probes, as correct orientation of contigs within genome assemblies might be adjusted over time and is dependent on specific reference genome assemblies. To ensure consistency in reporting SNP allele calls from SNP chip assays, Illumina uses their own TOP/BOT strand nomenclature and method (Illumina technical note 2006; Nelson et al. 2012; Zhao et al. 2018). Furthermore, the 50 bp SNP probe sequences for Illumina Infinium arrays are rarely directly published. Instead, a minimum of 101 bp sequences flanking the SNP polymorphism (with 50 bp on the left and 50 bp on the right of the SNP, e.g., Wang et al. 2014, Supplementary Table S5) is usually published, and this sequence is used by Illumina to design the final 50 bp probe sequences. To identify the exact 50 bp probe sequences used on the SNP chip (e.g., for BLAST analyses to a reference genome), the TOP/BOT designations for the submitted customer strand and for the Illumina design strand are provided in the manifest file from Illumina. This information can be accessed in the GenomeStudio software by importing the raw hybridization data (idat color files), sample and chip information (sample sheet, manifest file; ‘Customer Strand’ and ‘ILMN Strand’ columns in the SNP table). We applied these rules for identification of called alleles for the 15 root biomass-associated SNPs detected by the SNP chip array and validated the SNP chip calls by Sanger sequencing. Examples are given in Supplementary Table S2 for identification of SNP probe sequence composition. In diploid organisms, biallelic loci are expected to exhibit three cluster positions for a simple SNP (AA, AB, and BB; Fig. 2a). Based on the customer-submitted biallelic SNP identities, these clusters can be assigned to homozygous and heterozygous nucleotide calls following the Illumina TOP/BOT nomenclature by using information from the manifest file (see rules described in The Triticeae Toolbox T3 for details, https://triticeaetoolbox.org/wheat/). A summary of the data and the derived SNP alleles associated with high root dry biomass on the respective DNA strands are presented in Table 2. However, especially for polyploids like hexaploid bread wheat, observed clusters and predicted nucleotide allele designations do not always represent the homozygote and/or heterozygote states and correct nucleotide allele identities. Instead, they only represent the position of the clusters against the x and y axis, defined by the GenomeStudio software applying the manifest file provided by Illumina (plus any customized adaptations by the customer or a genotyping service provider). SNP array variant calling for 15 SNP probes associated with root biomass in wheat in Table 1 was performed by the commercial service provider TraitGenetics, applying a custom cluster file adapted using a large panel of world-wide wheat accessions. Non-polymorphic hemi-alleles were removed by TraitGenetics from the SNP call information and chromosome-specific nucleotide alleles from these example SNPs on chromosome 5B are predicted from the clustering patterns of each SNPs (Supplementary Figure S1, Fig. 2b–d). For 99% of the data points, these SNP calls were identical with SNP calls from the raw Illumina data files and the standard manifest file in GenomeStudio (after removing the 5A and 5D chromosome signal calls from the hemi-SNPs and translation into the IUPAC single-letter code). However, these SNP nucleotide call predictions were not always correct for one or both data sets. In 1 out of 15 cases for the 15 SNP assays we investigated, Sanger sequencing revealed that a different nucleotide allele combination should have been called and a different nucleotide than reported is associated with increase in wheat root biomass. This is shown for probe Excalibur_c60554_394 in Fig. 2d. Here, a T/G polymorphism was called and based on this data a T allele was predicted to be associated with high root biomass for haplotype 5B-RDMa-h2 (based on the predicted customer SNP variation submitted by the customer to Illumina). In contrast, Sanger sequencing of the 5B homeolog revealed a A/G polymorphism and based on this data an ‘A’ allele was found to be associated with high root biomass (Table 2). Sanger sequencing of the 5A and 5D homeologs and comparison with the wheat reference genome revealed that the SNP probe Excalibur_c60554_394 is specific for all three homeologous copies of the reference genome, but the 5A and 5D homeologs were found by Sanger sequencing to be monomorphic between all tested homozygous wheat accessions. In contrast, the 5B homeologue was found to be polymorph between accessions segregating with an A/G polymorphism. Thus, the polymorphism is 5B homeologue-specific and is detected by a typical hemi-SNP cluster pattern. However, as ‘A’ and ‘T’ are both measured by the same green color signal (Cy3), while ‘G’ and ‘C’ are both measured by the same red color signal (Cy5) in the dual-color Illumina detection system, the nucleotide segregation on chromosome 5B for a hemi-SNP is measured by a dosage-dependent clustering with two heterozygous clusters. Hence, when relying solely on the submitted customer SNP information embedded in the manifest file, this SNP is incorrectly predicted as a T/G polymorphism, instead of the correct ‘A/G’ polymorphism, due to the underlying ‘triallelic’ hemi-SNP (Fig. 2d). One cluster in Fig. 2d represents the ‘AATTGG’ (5B, 5A, 5D) allele composition (4 green signal, 2 red signal equivalents) for haplotype h2 (associated with high root biomass) with the ‘AA’ alleles coming from homeologue 5B. The other cluster in Fig. 2d represents the ‘GGTTGG’ allele composition (2 green, 4 red signal equivalents) for the low root biomass haplotypes, with two of the four ‘GG’ alleles derived from homeologue 5B. Prediction of allele composition for clusters observed in SNP chip data based on the homeologue-specificity of the SNP evaluated by genetic mapping and a reference genome might call incorrect nucleotide alleles for hemi-SNPs due to the dual-color nature of the detection system. Note that the cluster positions for some biallelic hemi-SNPs with correct SNP call predictions cannot be distinguished from triallelic hemi-SNPs with false SNP call predictions (Fig. 2c, d). Accordingly, when using the predicted T/G polymorphism for probe Excalibur_c60554_394 to design allele-specific KASP primers, we failed four times to achieve the expected clusters until we redesigned primers based on SNP flanking sequences of all 3 homeologs in 15 different genotypes produced by Sanger sequencing.
Table 2 SNP calls produced from raw SNP chip data using the software GenomeStudio and by a commercial SNP chip genotyping service provider (SP) and comparison with Sanger sequencing of regions flanking SNPs in up to six homeologous and paralogous copies of probe targets (5B1, 5B2, 5A1, 5A2, 5D1, 5D2); §Nucleotide alleles for the haplotype combination Hap-5B-RDMa-h2 and Hap-5B-RDMb-h3 associated with high root biomass are shown in bold letters Conversion rate into KASP assays using SNP probe flanking regions for primer design
Three successive approaches exhibiting increasing data analysis complexity were applied to convert SNPs detected by SNP arrays into valid predictive KASP assays (Table 3). In the first, simplest approach, the SNP flanking sequences (101 bp) submitted to Illumina for probe design (Supplementary Table S2) were used to design primers from KASP assays without any further consideration of the wheat genome composition. This is the approach applied by the commercial KASP assay design service (‘KASP-by-Design’) of LGC Biosearch Technologies. However, the standard genotyping service provided by LGC does not accept common and allele-specific primer sequences as input in their submission form. Instead, a minimum of 101 bp sequencing flanking the SNP are required as input (e.g., the sequence used to design the SNP probe by Illumina). When providing LGC the 101 bp flanking regions of the 15 SNPs associated with root biomass, without masking any region for primer design (Supplementary Table S2), 11 out of 15 assays produced clearly separated KASP clusters, but only 3 of 15 assays from haploblock Hap-5B-RDMb produced KASP clusters showing genotype SNP calling patterns consistent with the SNP chip data, using a reference genotype set of 213 lines (Table 3). This suggests a putative lack of locus-specificity of the designed KASP assays.
Table 3 Conversion rates for SNPs spanning both haploblocks Hap-5B-RDMa and Hap-5B-RDMb from SNP chip arrays into validated KASP marker assays One reason for this low conversion rate is that the probe flanking sequences used to design probes for the 90 K SNP chip are derived from RNA-Seq data and not from genomic data. Ignoring this might result in the design of primer sequences aligning across exon–intron boundaries within the wheat genome, consequently leading to a subsequent failure of the KASP assays (no signals). An example is shown in Supplementary File S2 (example 2). BLASTn analysis of the 101 bp SNP flanking sequences against the wheat reference genome IWGSC RefSeq v1.0 cv. Chinese Spring confirmed that 5 out of 9 SNP-containing sequences in haploblock Hap-5B-RDMa and 0 out of 6 SNP-containing sequences in Hap-5B-RDMb were interrupted by introns (Supplementary Table S4, example in Fig. 3a). This explained the assay conversion failure for 3 of these 5 SNP probes in haploblock Hap-5B-RDMb, in the remaining 2 the primer binding sites were placed just by chance outside of intron sequences. Also from the 12 SNP probes not KASP primers flanked by introns only 3 were successfully converted into KASP assays. Comparison of these sequence alignment data and SNP chip calling data from GenomeStudio (Supplementary Figure S1) with the obtained KASP cluster patterns revealed that only the 3 SNP probes showing high similarity to only one homeologue on chromosome 5B (Supplementary Table S4) and a simple SNP segregation pattern in SNP chip data produced clearly separated expected clusters in KASP assays with 213 genotypes. These 3 successful KASP assays were derived from the 101 bp flanking sequence for probes BS00022231_51, BS00110293_51, and Tdurum_contig48959_1172 (haploblock Hap-5B-RDMb).
Conversion rate into KASP assays using SNP probe flanking regions and comparative alignment to wheat reference genomes/assemblies for primer design
In a second approach (Table 3), the KASP assays were redesigned based on the obtained comparative alignment with the IWGSC RefSeq Chinese Spring v1.0 reference genome, avoiding intron spanning primer binding sites and locus-unspecific common primer binding sites by visual primer placement. This resulted in an increase in successful KASP assays so that the 3 remaining SNPs located within the haploblock Hap-5B-RDMb could be converted into validated KASP assays.
In contrast, only 2 out of 9 SNPs (Kukri_c46570_214, RAC875_c18088_2222) in haploblock Hap-5B-RDMa could be successfully converted into KASP assays producing the expected clusters when applying the second approach (Supplementary Table S4). From the 7 remaining KASP assays designed for haploblock Hap-5B-RDMa using this approach, 5 produced clearly separated clusters. However, although they were able as expected to call all 213 homozygous reference samples, they failed to call heterozygous samples correctly. To explore why the seven KASP assays designed using the 2nd approach for haploblock Hap-5B-RDMa failed to call genotypes correctly, 5B subgenome-specific primer pairs for Sanger sequencing of the SNP flanking regions (300–650 bp) were designed using the Chinese Spring reference genome and software package Primer3. These primers were used for PCR amplification from 40 lines harboring haplotypes Hap-5B-RDMa-h1 and -h2 (Table 4). For Hap-5B-RDMb, PCR amplification products of the expected sizes were always obtained for genotypes exhibiting haplotypes Hap-5B-RDMb-h1, -h2, -h3 and -h8. In contrast, for haploblock Hap-5B-RDMa the expected PCR products were always obtained for genotypes exhibiting haplotype Hap-5B-RDMa-h1, but in the majority of cases (7 out of 9 SNPs) not for genotypes exhibiting haplotype Hap-5B-RDMa-h2. Within the region flanking SNP BobWhite_c43_86, a length polymorphism between haplotypes Hap-5B-RDMa-h1 and -h2 was detected by gel electrophoresis and Sanger sequencing (Figs. 4, 8).
Table 4 Summary of results for PCR amplification for Sanger sequencing of SNP flanking regions using subgenome-specific primer pairs designed based on two different genome assemblies (IWGSC Chinese Spring v0.1 and Paragon) To explore why Sanger sequencing failed for genotypes exhibiting haplotype Hap-5B-RDMa-h2 seven subgenome-specific primer pairs were redesigned flanking the SNP using scaffold sequences from accession Paragon (Earlham Inst. v1) which also harbors haplotype 5B-RDMa-h2. PCR amplification products were obtained for all genotypes exhibiting haplotype Hap-5B-RDMa-h2 (Table 4). Based on the Sanger sequencing results KASP primers were designed along with multiple genomic resources. This led to a further increase in the number of successfully converted KASP assays from 2 to 5 out of 9 in haploblock Hap-5B-RDMa (Table 3, Supplementary Table S4, example 1 in Supplementary File S2). For the remaining 4 SNPs in haploblock Hap-5B-RDMa, the high subgenome similarity and polymorphism between reference genome/assemblies around the SNP position in haploblock Hap-5B-RDMa prevented conversion into locus-specific KASP assays.
Identification of reasons for incorrect cluster calling and strategies for improvement of clarity and locus-specificity
When testing initially designed KASP primer combinations with a set of 213 reference genotypes with known allele composition, 5 out of 9 assays for haploblock Hap-5B-RDMa and 3 out of 6 for haploblock Hap-5B-RDMb produced unexpected KASP clusters. Sanger sequencing of a set of 80 affected genotypes revealed different reasons that some genotypes were assigned to the wrong clusters (Fig. 5). A case which is often reported in the literature is a typical hemi-SNP shown in Fig. 5a, where all homozygous genotypes for one allele are clustering in a ‘heterozygous’ cluster due to the common primer binding to more than one subgenome. The other two cases of wrong clustering identified by Sanger sequencing were due to a mismatch of the 3′ end of the common primer site for one allele, a deletion at the 5′ end of the common primer or a size difference between the two allelic PCR products (Fig. 5b, c). All these cases result in preferential binding, amplification of one allele relative to another, and false calling of heterozygous genotypes as homozygous or vice versa.
False SNP call assignments for some heterozygous genotypes is particularly problematic if broadly applicable locus-specific KASP assays need to be developed for MAS. Figure 6a shows as an example for KASP assay HapA6-1, initially derived from SNP probe BobWhite_c43_86 in Haploblock Hap-5B-RDMa. Some heterozygous F1 genotypes from a cross of parents P1 × P2 are clustering falsely (AA, in green) together with the homozygous parent P1 (AA, in blue). Figure 6b shows that, if no heterozygous F1 reference genotypes are available, artificial mixtures of DNA from two divergent homozygous genotypes (P1 and P2) will cluster reliably in the expected heterozygote pattern over a wide range of concentration ratios between 1:9 and 9:1. Figure 6c shows that using artificial heterozygous DNA samples from parents P1 and P2, instead of natural heterozygote plants, can also allow identification of false clustering.
Optimization of KASP markers associated with root biomass for MAS
In total, 11 KASP assays were derived from two haploblocks and validated in 213 reference genotypes from major wheat growing regions of the world (Supplementary Figure S2). To test these assays for use in a breeding program, extensive validation in breeding material was applied to evaluate the robustness and limitations of the assays. The KASP assays were evaluated in a backcrossing program and applied for marker-assisted selection for increased root dry biomass. Of the 11 KASP assays, 3 were predicted to be sufficient to distinguish the haplotype combination Hap-5B-RDMa-h2 and Hap-5B-RMDb-h3 associated with high root biomass from other haplotype combinations associated with low root biomass (HapA6-1 for haploblock Hap-5B-RDMa; HapB3-2 and HapB6-1for haploblock Hap-5B-RDMb). In the backcrossing program, two parents with high root biomass originating from China, 2 parents from the Netherlands with low root biomass and 4 other parents representing adapted cultivars from Australia were used. In total, approximately 400 offspring were tested with the 3 selected KASP assays. Two of the assays produced stable and robust data (HapB3-2, HapB6-1). However, one KASP assay, HapA6-1, produced unexpected clustering results in the F1 and BC1F1 generations, revealing a homozygous cluster including a high number of genotypes with high root biomass in F1 and BC1F1 (Fig. 7b). Sanger sequencing of 8 parents was performed for the primer target site. Two genotypes used as parents in the breeding program revealed an 8 bp deletion at the common primer binding site (parental cultivars P2 and P8 in Fig. 8). Optimization and redesign of the KASP assay into the assay HapA6-2 produced the expected clusters (Fig. 7c) and was successfully adapted to the backcrossing program for high root biomass selection.