A heuristic method for fast and accurate phasing and imputation of single nucleotide polymorphism data in bi-parental plant populations

This paper presents a new heuristic method for phasing and imputation of genomic data in diploid plant species. Our method, called AlphaPlantImpute, explicitly leverages features of plant breeding programs to maximise the accuracy of imputation. The features are a small number of parents, which can be inbred and usually have high-density genomic data, and few recombinations separating parents and focal individuals genotyped at low-density (i.e. descendants that are the imputation targets). AlphaPlantImpute works roughly in three steps. First, it identifies informative low-density genotype markers in parents. Second, it tracks the inheritance of parental alleles and haplotypes to focal individuals at informative markers. Finally, it uses this low-density information as anchor points to impute focal individuals to high-density. We tested the imputation accuracy of AlphaPlantImpute in simulated bi-parental populations across different scenarios. We also compared its accuracy to existing software called PlantImpute. In general, AlphaPlantImpute had better or equal imputation accuracy as PlantImpute. The computational time and memory requirements of AlphaPlantImpute were tiny compared to PlantImpute. For example, accuracy of imputation was 0.96 for a scenario where both parents were inbred and genotyped at 25,000 markers per chromosome and a focal F2 individual was genotyped with 50 markers per chromosome. The maximum memory requirement for this scenario was 0.08 GB and took 37 seconds to complete.

heterozygous for that marker).

9 8
Step 4: Impute focal individual to high-density using anchors from the step 3 focal individual ID_Y is imputed as genotype 0. For markers on the HD array, assign parent-of-origin to marker alleles based 2 0 6 on the parent-of-origin assignment of the two nearest marker alleles on the LD array. second haplotype of focal individual ID_Y to Parent_B for both markers 5 and 7. We 2 1 0 therefore also assign marker 6 to Parent_B for the second haplotype. We have 2 1 1 assigned the first haplotype of focal individual ID_Y to Parent_A for marker 5 and to 2 1 2 Parent_B for marker 7. We conclude that there was a potential recombination around 2 1 3 marker 6 at the first haplotype and we do not assign parent-of-origin for this allele. For HD markers with assigned parent-of-origin in step 4b, we phase the allele 2 1 7 inherited from that parent for the haplotype of the focal individual. If we have phased 2 1 8 both alleles at a marker, we impute the genotype as the sum of the two alleles on the 2 1 9 two haplotypes of the focal individual. If parent-of-origin has not been assigned for 2 2 0 one or both alleles of the focal individual, we leave the genotype as missing.

1
Step 5. Impute markers in recombined regions 2 2 2 We phase and impute missing HD markers in potentially recombined regions 2 2 3 in one of two ways. We either (1) impute expected genotype dosage as the average of 2 2 4 the alleles of the two parents; or (2) phase and impute using information from a 2 2 5 genetic or physical map. For (2), we first identify the two closest neighbouring 2 2 6 markers that were informative and phased, second use the distance between these two 2 2 7 markers as a weight to phase the missing alleles as the weighted average of the alleles 2 2 8 of the two parent haplotypes, and third impute expected genotype dosage as in (1).

9
Implementation 2 3 0 1 2 We have implemented the method in a program called AlphaPlantImpute, on whether a marker is informative. AlphaPlantImpute implements some data editing 2 3 7 checks, which are described in the user manual. To test the imputation accuracy of AlphaPlantImpute, testing datasets of a in N e in a crop such as maize (Zea mays L.). These set points were: 100 in the base approximately 80,000 segregating sites in total.

5 9
Simulation of a pedigree parental populations were selfed to F 1 , F 2 , F 4 , F 10 , or F 20 , resulting in different levels were selfed to generate 100 F 2 individuals. F 2 individuals were selfed to generate 100 Increasing the number of LD markers increases the imputation accuracy of AlphaPlantImpute. Figure 2 plots the number of LD markers against the accuracy of 3 1 9 imputation for F 2 focal individuals of an F 20 x F 20 bi-parental cross. Figure 2 shows  Increasing the number of selfing events separating parents and focal 3 2 7 individuals slightly decreases the imputation accuracy of AlphaPlantImpute. Figure   3 2 8 3a plots the accuracy of imputation in F 2 , F 4 , F 6 and F 10 focal individuals of a bi- parental population where the parents were F 20 . Figure 3a shows that with 3 LD individuals and F 10 focal individuals. Figure 3a shows that with 20 LD markers, the higher for AlphaPlantImpute than for PlantImpute and visa versa. Figure 3b shows AlphaPlantImpute and 0.76 for PlantImpute for F 2 focal individuals and was 0.77 for 3 4 7 AlphaPlantImpute and 0.70 for PlantImpute for F 10 focal individuals.

4 8
For all numbers of selfing events separating parents and focal individuals, increasing the number of LD markers reduced and in some cases reversed the 3 5 0 advantage of AlphaPlantImpute over PlantImpute. This was most obvious for F 10 with PlantImpute was slightly higher than with AlphaPlantImpute. Figure 3b shows imputation for AlphaPlantImpute equalled that for PlantImpute. Figure 3b shows that with 100 LD markers, the average imputation accuracy was 0.99 for both 3 5 9 AlphaPlantImpute and PlantImpute for F 2 focal individuals and for F 10 focal 3 6 0 individuals.

6 1
For all numbers of selfing events separating parents and focal individuals, the 3 6 2 precision of imputation accuracy (i.e., consistency across focal individuals) for 3 6 3 AlphaPlantImpute was higher than for PlantImpute when the number of LD markers 3 6 4 was low. Figure 3c is similar to Figure 3b and plots the log of the precision of AlphaPlantImpute than for PlantImpute, and vice versa. Figure 3c shows that with 3 3 6 8 LD markers, the precision of imputation was 1.62 for AlphaPlantImpute and 1.08 for 3 6 9 PlantImpute for F 2 focal individuals and was 1.32 for AlphaPlantImpute and 1.11 for 3 7 0 PlantImpute for F 10 focal individuals. imputation accuracy for AlphaPlantImpute was higher than for PlantImpute for F 2 AlphaPlantImpute and PlantImpute for F 10 focal individuals. Increasing the level of inbreeding in the parents increases the imputation 3 8 2 accuracy for AlphaPlantImpute. Figure 4a plots the accuracy of imputation in F 2 focal 3 8 3 individuals of a bi-parental population where the parents were F 1 , F 2 , F 4 , F 10 or F 20 .
3 8 4 Figure 4a shows that with 20 LD markers, the average imputation accuracy increased  For all levels of inbreeding in the parents and all numbers of LD markers, the 3 9 1 average imputation accuracy with AlphaPlantImpute was almost always higher than 3 9 2 with PlantImpute. Figure 4b is similar to Figure 3b and plots the average imputation shapes represent the level of inbreeding in the parents. Figure 4b shows that with 20 3 9 5 SNP LD markers, the average imputation accuracy was 0.81 for AlphaPlantImpute 3 9 6 and 0.74 for PlantImpute for F 2 focal individuals when parents were F 1 , 0.95 for 3 9 7 AlphaPlantImpute and 0.91 for PlantImpute when parents were F 4 , and 0.96 for 3 9 8 AlphaPlantImpute and 0.94 for PlantImpute when parents were F 10 . In two cases, the 3 9 9 average imputation accuracy with PlantImpute was slightly higher than with 4 0 0 AlphaPlantImpute. This was when parents were F 4 and with 3 and 5 LD markers. The axis. Figure 4c shows that with 20 LD markers, the precision of imputation accuracy for AlphaPlantImpute and 4.00 for PlantImpute with 400 LD markers. population where the parents were F 20 . Figure 5a shows that with 3 LD markers, imputation accuracy was less or non-existent when the number of LD markers was 2 1 higher than 10. Figure 5a shows that the imputation accuracy was approximately 0.98 4 2 9 for all chromosome sizes when the number of LD markers was 50.

3 0
When the chromosome size was 300 cM or less, the average imputation imputation accuracy for AlphaPlantImpute was generally higher than for PlantImpute. AlphaPlantImpute on the y-axis and for PlantImpute on the x-axis. Figure 5c shows apparent when the number of LD markers was low. Figure 6 plots the accuracy of 50 or 100 focal individuals. Figure 6 shows that increasing the number of focal 4 6 0 individuals from 5 to 100 increased the average imputation accuracy from 0.83 to 4 6 1 0.85 when 3 LD markers were used. Figure 6 also shows that when the 10 or more LD 4 6 2 markers were used, increasing the number of focal individuals had no effect on the  twelve datasets across the three scenarios. Datasets were chosen to reflect the PlantImpute is marginalizing over the all possible phase and genotype, which is 5 4 0 probabilistically correct and handles the uncertainty properly, but it seems this is 5 4 1 lowering the imputation accuracy. One exception to this was when the chromosome PlantImpute than for AlphaPlantImpute for all LD arrays.

4 4
The biggest advantage of AlphaPlantImpute compared to PlantImpute relates  Finally, although SNP arrays for the many domesticated plant species exist, 5 8 3 low-coverage sequencing methods such as genotyping-by-sequencing are also used.

8 4
The heuristics of AlphaPlantImpute might be extended to enable imputation with such 5 8 5 data.

8 6
Software availability 5 8 7 We implemented our method in a software package called AlphaPlantImpute, http://www.AlphaGenes.roslin.ed.ac.uk/AlphaPlantImpute/ along with a user manual.   The number of SNP on the LD panel against the genotype imputation accuracy using AlphaPlantImpute for F 2 focal individuals of a bi-parental cross where the parents are F 20 inbred individuals. (a) The genotype imputation accuracy using AlphaPlantImpute in F 2 focal individuals of a bi-parental cross where the parents are F 1 , F 2 , F 4 , F 10 or F 20 . (b) Comparison of the average genotype imputation accuracy using AlphaPlantImpute (y-axis) vs. using PlantImpute (x-axis). The colours represent the different LD panels. The shapes represent the level of inbreeding in the parents. The red diagonal line indicates when the accuracy of PlantImpute equals AlphaPlantImpute. Points above the line are when imputation accuracy is higher with AlphaPlantImpute and points below the line are when imputation accuracy is higher with PlantImpute.
(c) Comparison of the precision in imputation accuracy using AlphaPlantImpute (yaxis) vs. using PlantImpute (x-axis). The colours represent the different LD panels. The shapes represent the level of inbreeding in the parents. The red diagonal line indicates when the precision of PlantImpute equals AlphaPlantImpute. Points above the line indicate when the precision in accuracies is higher in AlphaPlantImpute and points below the line are when the precision in accuracies is higher in PlantImpute.
(c) Figure 5 -Effect of chromosome size.
(a) The genotype imputation accuracy using AlphaPlantImpute in F 2 focal individuals from a bi-parental cross of F 20 parents against seven chromosome sizes of 25, 50, 100, 150, 200, 300, and 400 cM.
(b) Comparison of the average genotype imputation accuracy using AlphaPlantImpute (y-axis) vs. using PlantImpute (x-axis). The colours represent the different LD panels. The shapes represent the chromosome size. The red diagonal line indicates when the accuracy of PlantImpute equals AlphaPlantImpute. Points above the line are when imputation accuracy is higher with AlphaPlantImpute and points below the line are when imputation accuracy is higher with PlantImpute.
(c) Comparison of the precision in imputation accuracy using AlphaPlantImpute (yaxis) vs. using PlantImpute (x-axis). The colours represent the different LD panels. The shapes represent the chromosome size. The red diagonal line indicates when precision of PlantImpute equals AlphaPlantImpute. Points above the line indicate when the precision in accuracies is higher in AlphaPlantImpute and points below the line are when the precision in accuracies is higher in PlantImpute.
(c) Figure 6 -Effect of the number of focal individuals in the bi-parental population.
The number of focal individuals in the bi-parental population against the genotype imputation accuracy using AlphaPlantImpute for F 2 focal individuals of a bi-parental cross where the parents are F 20 inbred individuals.