Phasing of 2-SNP Genotypes Based on Non-random Mating Model
Emerging microarray technologies allow genotyping of long genome sequences resulting in huge amount of data. A key challenge is to provide an accurate phasing of very long single nucleotide polymorphism (SNP) sequences. In this paper we explore phasing of genotypes with 2 SNPs adjusted to the non-random mating model and then apply it to the haplotype inference of complete genotypes using maximum spanning trees. The runtime of the algorithm is O(nm(n+m)), where n and m are the number of genotypes and SNPs, respectively. The proposed phasing algorithm (2SNP) can be used for comparatively accurate phasing of large number of very long genome sequences. On datasets across 79 regions from HapMap  2SNP is several orders of magnitude faster than GERBIL and PHASE while matching them in quality measured by the number of correctly phased genotypes, single-site and switching errors. For example, 2SNP requires 41 s on Pentium 4 2Ghz processor to phase 30 genotypes with 1381 SNPs (ENm010.7p15:2 data from HapMap) versus GERBIL and PHASE requiring more than a week of runtime and admitting no less errors than 2SNP. 2SNP software is publicly available at http://alla.cs.gsu.edu/~software/2SNP.