The completion of the human genome project marks a significant milestone in genetic research, ushering in an era of research opportunities in the application of genomic technologies to medical and public health problems [13]. One area of application involves the identification and characterisation of DNA sequence variation and its relationship (or association) with, for example, disease susceptibility. Many initiatives have been put in place to facilitate relevant association studies, but the most important is the International HapMap Project (IHP) [4]. The assignment and analysis of haplotype frequencies (ie the number of times alleles at different loci are observed together on the same chromosome in a sample of individuals) can not only lead to estimates of linkage disequilibrium (LD) strength, but can also be used as the basis for a number of additional phenomena and analyses -- such as the comparison of population genetics structures (eg immigration rates, genetic distances, etc), the consideration of chromosome phylogeny and the estimation of the age of mutations [515]. Moreover, the use of haplotypes may result in considerable savings in terms of genotyping costs and power of an association study [1618].

Unfortunately, many current genotyping technologies are unable to resolve the phase of maternal and paternal chromosomes in unrelated individuals, and hence the actual haplotypes an individual possesses may be in doubt. This ambiguity is referred to as the 'haplotype problem', and its complexity increases exponentially with the number of loci being studied. Although there are technologies that can be used to unambiguously resolve phase at the chromosome or DNA level, they tend to be cost prohibitive [1924]. Haplotype analysis involving related individuals (individuals collected from families and/or pedigrees) potentially offers more information and certain advantages compared with analysis involving unrelated individuals. Family based analysis imposes additional challenges and may not be suitable for all study designs or research objectives [5, 2527]. A companion review that focuses on computer programs and issues related to haplotype analyses involving related individuals will follow [28]. Statistical procedures are therefore required to both estimate haplotype frequencies and assign the most likely haplotypes to unrelated individuals from genotype data [23, 29, 30]. In this paper, available computer programs for haplotype frequency estimation will be considered as well as assignment of haplotypes involving unrelated individuals. The paper builds on an earlier review,[31] recent discussions of relevant algorithms [32, 33] and articles comparing different procedures [30, 3436]. Some simple recommendations are made for addressing specific research questions using available software. Finally, web-based summaries of these evaluation are available and provide greater detail than that outlined here (URL:

Materials and methods

Identification of software

Available software was identified through four means:

1) searching PUBMED through to June 2004; 2) reviewing cited references of retrieved papers and reviews of papers; 3) internet searches (eg via Google); and 4) communication with investigators working in the field. The PUBMED and/or internet searches included the following terms or combinations of terms: 'haplotyping', 'haplotype', 'analysis', 'methods', 'software', 'inference', 'assignment', 'problem', 'unrelated', 'population' and 'pooled'.

The methods, features and limitations of the identified programs were evaluated using the original published articles describing the methods, the manuals associated with the software and articles comparing programs and methodologies. The assessments provided here, build on an earlier review,[31] published discussions of algorithms for haplotype analysis [32, 33] and articles contrasting different methodologies [30, 3436]. Accuracy of the methods used for estimating haplotype frequencies and assigning haplotypes to individuals was considered to be of particular importance. Ideally, validation of an indirect (ie statistically-based) haplotyping method should be compared with direct, DNA sequence-derived haplotype information. Although studies with simulated data are also informative, allowing discrimination of program performance under a variety of situations, without a 'gold standard' for comparison purposes it is hard to assess the true reliability of a method. The large number of reviewed programs precludes systematic testing of the identified programs' accuracy, performance and claims. The evaluation of this large group of programs is complicated by the diversity of methods used, measures of reliability algorithms used, varying datasets and assumptions and program characteristics which limit or prevent a program from working in all instances. The authors have endeavoured to provide a thorough review of the literature of haplotyping software in unrelated individuals, but it is acknowledged that not all original authors' claims have been validated (Supplemental Table S-A provides a brief summary of reviewed articles in which programs were actually compared). Thus, there is a reliance on some authors' claims that have not been independently verified. The majority of identified programs are freely available to academic and non-profit users. Finally, recommendations are provided for specific research objectives.

Evaluation criteria

The identified computer programs were evaluated on the basis of a number of criteria and/or software features. Many of these features and criteria were considered because they reflect items that should guide the use of particular haplotyping software.

  1. 1.

    Algorithms and methods: the analytical methods and algorithms implemented in the available programs are considered. Essentially, algorithms can be divided into two broad classes: parsimony and likelihood methods.

  2. 2.

    Accuracy: the accuracy of haplotyping algorithms is considered in terms of the algorithms ability to assess haplotype frequencies from a sample of unrelated individuals, as well as to assign haplotypes to particular individuals. Measures of accuracy are discussed briefly in the accuracy section and are detailed on the above-mentioned website (see supplementary Table S-B).

  3. 3.

    Assumptions: haplotyping programs often make assumptions about, for example, Hardy-Weinberg equilibrium (HWE), LD, population history and recombination. These assumptions can have an impact on the accuracy of haplotype frequency estimates and assignments.

  4. 4.

    Genotyping error: the accommodation of genotyping error in haplotype inference is considered. Programs that identify and accommodate genotyping errors are noted.

  5. 5.

    Hypothesis testing: not all programs have the ability to conduct statistical tests of hypotheses, so this feature is considered as well.

  6. 6.

    Missing data: the accommodation of missing data in haplotype analysis is considered.

  7. 7.

    Software characteristics: issues related to the usability of programs are considered, including computer system requirements, input data formats, interfaces, output, run time and sample size.

  8. 8.

    Web implementation: web-based implementations of available computer programs are considered.


Forty-six haplotyping programs were identified and reviewed. The programs were divided into two groups: those designed for analyses involving individual genotype data from unrelated individuals (a total of 43 programs) and those designed for analysis of DNA pools (three total programs). An overview of reviewed programs is presented in Tables 1, 2, 3, 4 and in Supplemental Tables S1-S4, S-A and S-B: ( Additional information on the software programs discussed in this paper, links and contact information for programs, all supplemental tables, updates to existing software and newly released software are available at the following website:

Table 1 Description of unrelated haplotyping programs, divided into four classes based on method.
Table 2 Web-based haplotyping programs and related websites.
Table 3 Haplotyping software for hypothesis testing and analysis.
Table 4 Description of programs designed for pooled samples.

The majority of identified programs for estimating haplotype frequencies and assigning them to individuals use methods rooted in likelihood theory (eg for estimation purposes -- primarily the maximum likelihood approach). From a survey of the literature, it appears that most of the programs give similar results, although performance is not always consistent. No group or individual program appears to work well in all situations, or have all the features one might like to see implemented in a haplotype analysis program. It appears that accuracy and performance are affected by the characteristics of the data to be analysed and the characteristics of the population from which the individuals are sampled.

Haplotyping in unrelated populations

Algorithms and methods

A number of different analytical methods have been proposed for haplotype analysis involving unrelated individuals (see Table 1 and Supplemental Table S1). Ultimate classification of haplotyping algorithms is difficult, since implemented algorithms are often modified and combined in programs. A broad classification can be made, however, between algorithms based on parsimony and algorithms based on likelihood theory. An overview of each of these classes is provided below.

Methods based on parsimony: in 1990, Clark [37] proposed an innovative method of constructing haplotypes using a rule-based algorithm. This simple method uses the frequencies of individuals whose haplotypes are known with certainty (eg individuals homozygous at every loci) to draw inferences about the most likely haplotypes for individuals whose haplotypes are ambiguous, given their genotype data. HAPINFREX, which employs Clark's method, is computationally fast and efficient and has been used in a great deal of research [14, 38] Limitations of the method include the requirement of unambiguous individuals in the study population, sensitivity to the order in which data are analysed, the inability to assign haplotypes to all individuals and potentially erroneous haplotype assignments [37, 39]. To overcome these limitations, a pure parsimony extension, using integer linear programming, has been proposed [40, 41] and implemented in the program HAPAR [39]. Extensions of parsimony methods take advantage of the 'perfect phylogeny framework' [40]. These programs apply the results of recent research that indicates that recombination is uncommon within LD blocks [1618] for efficient and effective haplotype analysis. Perfect phylogeny haplotyping (PPH) reduces the haplotype analysis problem to a phylogeny problem [40] by making the assumptions of no recombination and infinite site mutations. Along this framework, unphased genotype data are reduced to a 'graph realisation problem' and solved using metroid theory and graph analysis in GPPH, although a unique solution is not guaranteed [40, 42]. A simpler alternative method based on graph analysis is employed by DPPH [43]. Since empirical data may violate the prefect phylogeny assumption,[44] the assumption is relaxed in the 'perfect phylogeny' model implemented in HAPH[44] and BPPH [45]. HAPH constructs haplotypes within LD blocks using a maximum likelihood method.

Methods based on likelihood theory: the majority of programs that could be located are rooted in likelihood theory. Methods that exploit likelihood theory can be further broken down into maximum likelihood and Bayesian methods. The expectation maximisation (EM) algorithm is the most widely used haplotyping algorithm based on likelihood theory. In 1995, three research groups separately implemented and published EM-based haplotyping programs, 3locus.PAS,[46] HAPLOH[47] and MLHAPFRE [48]. Excoffier and Slatkin [48] present a discussion of the challenges and limitations of applying the EM algorithm to haplotype analysis. In brief, the EM method has two parts, a likelihood function using initial parameter inputs and estimating sets of haplotypes that maximise the posterior probabilities of given genotypes. The estimates are iteratively updated to maximise the likelihood function.

The EM algorithm has been shown to be accurate via simulations,[49] and produces haplotype frequency estimates comparable to molecular haplotype frequencies [23, 29, 30]. Moreover, much of the error in haplotype frequency estimation associated with the EM algorithm has been found to be due to sampling error [29, 40]. The EM algorithm may occasionally miscall rare or low frequency haplotypes [29, 30, 49, 50]. Accuracy of the EM algorithm improves with increasing sample size [49]. The EM algorithm does have some limitations: it may converge to a non-global maximum, requiring restarts to ensure that a global maximum is reached [48, 49] and it can make demands on memory requirements that may limit its utility with large numbers of subjects and datasets [48, 51].

Variants of the EM algorithm have been developed that allow the EM algorithm to overcome some of these constraints. The SNPHAP program handles the limitations by progressively expanding the subsets of markers and eliminating low frequency haplotypes from consideration at each step (refereed to as posterior and prior 'trimming') [52]. The THESIAS program uses a stochastic variant of the EM algorithm to overcome many of its limitations [53]. Alternatively, the PL-EM program combines a partition-ligation (PL) strategy with the EM algorithm to allow haplotyping of hundreds of loci [5456]. The HPLUS program combines the EM likelihood function with an estimating equation and the PL model to efficiently handle construction of large haplotypes with missing data [55].

The second class of likelihood algorithms are based on Bayesian estimators and Bayesian-based numerical strategies, such as Gibbs sampling [51, 5761]. Bayesian methods use different models or prior assumptions to model haplotype frequencies, and as such can be tailored to different settings, thereby improving its accuracy. Bayesian haplotype analysis methods can be further subdivided into 'simple' and 'coalescent-based' methods. The simple methods make no assumption about the history of the populations from which samples of individuals have been drawn. Simple Bayesian programs include HAPLOTYPER and HAPLOREC. HAPLOTYPER uses a statistical method similar to EM [57]. HAPLOREC implements a Bayesian method using a Variable Length Markov Chain chain approach [62]. The coalescent-based Bayesian methods essentially take similarities between and among haplotypes into account. This class includes the widely-used program, PHASE. The latest version of PHASE (v2.0) incorporates an updated algorithm to improve accuracy and the PL algorithm to improve performance time [59]. A modified model, the neutral coalescent model, is implemented in SLHAP v1.0 [58]. SLHAP v1.0 builds on PHASE v1.0 to include modifications to improve computation time and to accommodate missing data [58]. Finally, Arlequin (version 3.0) draws on the coalescent model, exploiting a relaxed definition for similar haplotypes in an adaptive window approach [60].


The accuracy of available programs was assessed through consideration of published articles investigating haplotype frequency estimation and assignment accuracy, including comparisons to molecular and simulated haplotype data. The measurement of the accuracy of a haplotyping method necessitates a comparison, comparing observed haplotype assignments and/or frequency estimates to expected haplotypes. The 'gold standard' for comparison is DNA sequence-derived haplotype information. The advantage of using accurate molecular haplotype data is that no assumptions, guiding, for example, simulations, are specified. The accuracy of a specific program is not influenced or biased by assumptions imposed in simulated data. Additional testing, including the discrimination of program performance under a variety of situations and assumptions is facilitated with use of simulated data.

Comparison of accuracy between haplotyping programs is a taxing venture, complicated by a variety of issues. A significant challenge is that most programs have not been directly compared with each other (Supplemental Table S-A provides a brief overview of retrieved articles that compared accuracy and performance of programs). Only a small set of programs are compared in each individual paper. Comparison of accuracy and performance of these select programs is often carried out with different datasets and under varying conditions.

A further challenge is that numerous measures have been used to assess accuracy, and these vary across publications, which are described in the reviewed literature. In brief, several measures of global accuracy of frequency estimates/assignments were found: discrepancy, error rate, mean square error (MSE), similarity index If and similarity index IS, in addition to several measures comparing similarity of incorrect haplotype assignments to true haplotypes: hamming distance 'error rate H', similarity index IG, single site error rate and switch accuracy (see Supplemental Table S-B for detailed accuracy definitions). Divergent results may be attributable to the method of accuracy measurement. Unfortunately, a comparison of the different accuracy measures was not identified in reviewed literature.

To illustrate this, a relatively simple example of four articles that all focus on comparing the PHASE (v1.0) program to EM-based programs is provided here. An original publication describing PHASE (v1.0) reported that the program out-performed other haplotyping methods, reducing MSE rates by more than 50 per cent relative to the HAPINFREX program and a program with a standard EM algorithm [51]. A subsequent comparison [35] between PHASE v1.0 and a standard EM program comparing accuracy, measured by discrepancy error rates, showed that average error rates did not differ statistically between EM-based methods and PHASE v1.0. This finding was seen across simulated and phase-known data [35]. In rebuttal, Stephens et al [63]. showed that PHASE v1.0 outperforms HAPLOTYPER and PL-EM, with lower error rates on data simulated to fit a coalescent model. The results were reversed when a dataset of molecular haplotypes was used, where HAPLOTYPER and PL-EM were comparable, with both outperforming PHASE v1.0 [57].

As this example demonstrates, characteristics inherent to a specific dataset whether molecular or simulated data, influence the performance and accuracy of a program. This may influence the perceived accuracy and performance of a haplotyping program. Moreover, the studies did not compare identical set of programs. Both Stephens et al [51]. and Zhang et al [35]. employed their own standard versions of the EM algorithm, which should be comparable but may not have identical specifications. A further challenge is that, while PL-EM is an EM-based program, it is one of several EM programs that have been modified to overcome performance problems of the EM algorithm, as discussed previously. Therefore, the improvement in the performance of the EM-based program, PL-EM, versus PHASE may not necessarily be generalisable to all EM-based programs. To overcome these problems, Stephens et al[59] compared their updated version of PHASE (v2.0) with several programs, using the same datasets and measures of accuracy as published comparisons of PHASE v1.0 to other programs [57, 58].

Overall, programs based on the Bayesian principles, EM algorithm and imperfect phylogeny performed similarly with sequence-derived and simulated haplotype data. As shown previously,[31] no program or algorithm clearly distinguished itself from the rest. While Clark's intuitive method has shown utility, the present assessment of the literature suggests that other methods offer distinct advantages. The performance of all programs is affected by model assumptions and population genetic parameters. The impact of these assumptions is discussed below.


This section focuses on several common assumptions incorporated in haplotyping programs. Departures from or violations of these assumptions may affect program accuracy and performance. The assumptions are related to each other; violation of one assumption may lead to violation of a second. For ease of evaluation and discussion, each assumption is addressed separately. Program assumptions (HWE, LD, population history, etc) are noted in Tables 1 and S1.

Hardy-Weinberg equilibrium: as described in Tables 1 and 4, many programs -- including all EM algorithm-based programs -- assume HWE. Algorithms that assume HWE may be sensitive to departures from this assumption. Departures from HWE arise either from excess homozygosity or heterozygosity at a locus in a population. Measures evaluating departures from HWE have been shown to correlate with haplotype frequency estimation and assignment inference accuracy [57]. Increases in homozygosity tend to decrease the number of ambiguous individuals (ie individuals whose phase cannot be determined with certainty) and have been shown to have little impact on the accuracy of the EM-based method, as measured by the MSE [49, 64]. By contrast, accuracy decreases with HWE departures resulting from increased heterozygosity. Comparing the performance of HAPINFREX, EM-DECODER, PHASE v1.0 and HAPLOTYPER in simulated data with varying HWE departures found that all methods showed increased error levels with excess heterozygosity [57]. HAPINFREX was most vulnerable to HWE departures, particularly underperforming in situations with low numbers of homozygotes. Performance improves rapidly with increasing proportions of homozygotes in a population [57]. In data with a significant proportion of homozygous individuals, HAPINFREX outperformed PHASE v1.0 [57]. In an evaluation of HPLUS on simulated data with HWE departures, accuracy improved with increasing sample size, although little benefit was achieved with samples beyond 100 subjects [55].

Linkage disequilibrium and recombination: research suggests that recombination hotspots -- that is, chromosomal segments with high levels of recombination -- tend to be separated by extended LD or haplotype 'blocks' exhibiting little recombination and strong LD. This structuring of LD blocks may be common in the human genome [1618, 65]. Highly variable recombination rates in a small genomic region may violate assumptions of the current coalescent-based programs;[51, 58] however, all methods may have problems constructing haplotypes across regions with high levels of recombination [57, 60] and low LD [36]. While a majority of programs do not make explicit assumptions about LD, the performance of both EM methods [29, 36, 48, 64] and PHASE v1.0 [51] has been shown to improve with increasing LD. Comparisons of the accuracy of PHASE v1.0, HAPLOTYPER and Arlequin v3.0, showed that accuracy was adversely affected by increases in the recombination rate [60]. Doubling in theta (θ) -- that is, the mutation rate per locus -- results in a 5-10 per cent decrease in accuracy for both Arlequin v3.0 and PHASE v1.0. By contrast, the global accuracy of HAPLTOYPER increased with theta in some situations [60]. In this comparison, Arlequin v3.0 demonstrated the highest accuracy in the presence of recombination, by using a sliding windows approach to phase loci. Performance measured by a similarity index for HPLUS declined with increasing number of single nucleotide polymorphisms (SNPs) for a simulated dataset with recombination, although this trend was not observed with MSE [55].

The PL method used by HAPLOTYPER was shown to be insensitive to the presence of recombination hotspots, although extensive recombination may be problematic [57]. Accuracy improves when hotspots are used as the partition sites, however [54, 57]. PL-EM allows users to specify the partition size, thereby allowing partitioning at the hotspot. Focusing on DNA segments in LD offers a method to overcome the challenges and errors related to haplotyping in the presence of recombination hotspots. Since the recombination hotspots are not known in advance, automating the identification of LD block boundaries, haplotyping within blocks may offer significant benefits [40, 57] Several programs, notably HAPH, SLHAP v1.0 and PHASE v2.0, have exploited this methodology. SLHAP v1.0 [58] and HAPH have been reported to improve the accuracy of inferred haplotypes. A related approach limits haplotype analysis to segments in LD. HAPLOREC based on the variable-length chains allows the program to obtain different length haplotype fragments in different regions, based on the LD strength [62]. A drawback of these methods is that it may lead to a loss of phase information [66]. PHASE v2.0 incorporates a separate algorithm to accommodate recombination, based on the method proposed by Fearnhead and Donnelly [67].

Evaluation of linkage and recombination is an important first step in haplotype analysis. The HAPH and HAPLOVIEW programs identify haplotype blocks in a graphical display. Data that contain recombination hotspots may pose a challenge to haplotyping software that assumes no recombination. Decreases in LD are correlated with increasing estimation error [36] and magnify the effects of genotyping error;[68] thus, although haplotyping with loci whose alleles are in low LD is important, haplotype estimates from such data may be unreliable. Further study in this area is required, particularly in situations of intermediate LD levels; the influence of LD level on accuracy and determination of the LD level that, if surpassed, improves accuracy. This is not trivial, especially if many loci are considered, each with varying degrees of LD by comparison with the others.

As one would expect, recombination leads to an increase in the number of haplotypes, including low frequency haplotypes that are difficult to estimate accurately [36, 49, 53]. Increasing sample size may improve haplotyping accuracy in the presence of high recombination [39]. Finally, analysing chromosome segments on either side of a recombination hotspot is most likely to be the only current viable option [8].

Population evolutionary history: several programs impose assumptions on the evolutionary history of the populations from which samples have been obtained to improve program efficiency and accuracy and simplify haplotype analysis. The PHASE program is the best-known example of a program that incorporates a population evolutionary history model -- in this case the coalescent model [51, 59]. Moreover, the SLHAP v1.0 [58] and Arlequin v3.0 [60] programs are based on variants of the coalescent model. Several programs exploit the 'perfect phylogeny' concept. These programs (GPPH, DPPH and BPPH) are reported to be fast and accurate and to accommodate large numbers of markers [40, 42, 43, 45]. The HAPH program uses a relaxed model -- imperfect phylogeny -- to make the model more amenable to what is currently known about population evolutionary history [44].

The benefit of incorporating an evolutionary model, such as the coalescent model, is to take advantage of similarities between haplotypes; it is thought to result in more accurate haplotypes than other methods [51, 59]. The disadvantage is that the behaviour of alleles in the short-term evolution of chromosomes may violate the model, potentially leading to errors. By contrast, HAPLOTYPER, HAPINFREX and HAPAR impose no population evolutionary history assumptions. Program performance and accuracy may be affected when data fit or do not fit the program's population assumption. To illustrate, Stephens et al [51]. note that PHASE v1.0, by comparison with EM algorithm-based methods, would reduce error rates by 50 per cent when data fit the coalescent model. When compared to PL-EM, using similar data, the improvement in error rate was 26 per cent lower than that shown by Stephens et al. for data that fit the coalescent model [54].

The coalescent model is appropriate for stable populations that have evolved over long periods of time, but is less suitable for populations with past gene flow, stratification and/or population migration. There is disagreement as to whether haplotyping programs based on the coalescence model are the most appropriate for accurate haplotyping [35, 51, 57]. Even when data do not fit the coalescent model, the performance of PHASE v1.0 is suggested to be no worse than that of EM methods [63]. Using simulated data that violate the coalescent model, Niu et al [57]. showed that HAPLTOYPER and EM-DECODER are more accurate than PHASE v1.0 and HAPINFREX. The decline in performance of PHASE v1.0 in at least one of the instances may have been due to insufficient updates rather than model assumptions [59]. The findings of Niu et al. were supported in a subsequent comparison of PHASE v1.0, HAPLOTYPER and Arlequin v3.0 [60]. Arlequin v3.0 had the highest accuracy of the three programs when the coalescent model was violated. In a comparison of PHASE v1.0, HAPINFREX, HAPAR and HAPLOTYPER using data modelled to fit the coalescence model, PHASE v1.0 yielded the lowest error rate, followed by HAPAR [39]. The updated version of PHASE v2.0 demonstrated improved performance with molecular haplotype data, exceeding the performance of HAPLOTYPER, SLHAP v1.0 and the earlier version of PHASE [59]. An additional study assessed performance of PHASE v1.0, HAPAR and HAPLOTYPER using data simulated to fit the phylogeny model, an evolutionary model related to the coalescence model. The comparison found that PHASE v1.0 had the lowest error rate, followed by HAPAR and HAPLOTYPER. Error rates became similar for the three programs as sample size increased [39]. In summary, programs that assume a population evolutionary history of data should be used with care, since departures from model assumptions may have a significant impact on the accuracy of haplotype assignments and estimates. This should in no way detract from the utility and flexibility of these programs, but serves to illustrate that model assumptions should be considered when these programs are used.

Genotyping error

Genotyping error is a form of misclassification which can lead to deleterious effects on the power of association analyses,[6972] LD measurements [69] and erroneous haplotype analysis [60, 68, 73, 74]. The power of SNP association studies decreases with even relatively small genotyping error rates [71]. A similar trend may exist for haplotype association studies, although further examination is required. Sample size requirements of varying SNP error rates and power levels can be examined at the Power for Association with Error (PAWE) website [70, 71] (see Tables 2 and S2).

Most genotyping errors are due to allelic dropout (missing data) and the inability to score heterozygotes, resulting in an increased proportion of homozygotes [73, 75]. Non-random distributions of missing genotypes represent an error in genotype assignments. Programs that deal with missing data often do so by assuming that data are missing at random. Spurious haplotypes may be introduced if loci with genotype errors are included in haplotype analysis [60]. Error rates of 5 per cent may bias haplotype estimates by as much as 30 per cent [72]. Genotyping error leads to a substantial loss in haplotype accuracy, particularly when LD is low and many rare haplotypes exist [74]. Haplotyping methods that favour similar haplotypes may be less sensitive to genotyping error [60]. Recently, Zou and Zhao [72] introduced an EM-based program that corrects haplotype frequency estimates for known genotype error rates, although determining genotyping error can be difficult in unrelated populations [7678]. A common strategy is to genotype a subset of the study population twice, to determine error rates. Genotyping as few as 25 individuals has been shown to be sufficient for determining genotyping error in a simulation study [76]. Testing assay specificity and HWE deviations of loci are established methods for reducing genotyping error rates [79]. Finally, the accuracy and power of association analyses may be improved by incorporating genotyping uncertainty in haplotype inference to negate the effects of genotyping errors, as in GS-EM [73].

Missing data

Current genotyping methods often result in missing data, owing to a variety of factors, including, for example, polymerase chain reaction dropouts, inability to score loci and systematic genotyping technology errors. Missing data complicate haplotype inference by increasing the difficulty and uncertainty of haplotype estimates. Missing data decrease the available information and may bias the haplotype assignment. The majority of programs score poorly in this area, as they are unable to accommodate any missing data (see Tables 1 and 4 for programs that accommodate missing data). Some of these programs deal with missing data by ignoring subjects with any missing marker data, leading to a loss of data. Most programs assume that missing data are missing at random (see the section above, on genotyping error).

Accommodating missing data results in a performance decline, with increased memory requirements, longer run times and increased uncertainty. Several strategies have been proposed and implemented for dealing with haplotyping in the presence of missing data. The EM algorithm can be set to accommodate missing data; a discussion focusing on EM haplotyping and missing data is provided elsewhere [80]. Among EM-based programs, LOGINSERM_ESTIHAPLOE includes the option of ignoring individuals with missing data or of using them in haplotype inference, depending on research objectives,[80] whereas PL-EM allows users to specify the number of possible haplotype sets with a probability above a specific level [54]. By contrast, HAPH ignores missing markers in haplotype construction, and uses a maximum likelihood method to infer missing allele(s) to match common haplotypes [44]. The accuracy of HAPH was maintained with up to 10 per cent missing data. Arlequin v3.0 does not try to impute missing data in haplotype analysis, but rather ignores missing loci in the process [60]. This approach is sensitive to the amount of missing data, with small decreases in accuracy with up to 2 per cent missing data becoming more noticeable at 4 per cent. Moreover, the addition of a subset of individuals with large amounts of missing data (20 per cent) has been shown to have a detrimental effect on haplotype analysis on the larger group with complete data [60].

A limitation of the original version of PHASE (v1.0) was that it could not accommodate missing data [51]. SLHAP v1.0, based on of PHASE v1.0's methods, includes modifications that allow accommodation of missing data [58]. The updated version of PHASE v2.0 was also adapted to accept missing data; phase at unknown positions is randomised and any missing genotypes are imputed with random guesses [59]. The HAPLOREC program also handles missing data by matching haplotypes with missing data to known haplotypes, although missing alleles are not imputed [62]. Finally, the performance of HAPLOTYPER was shown to be stable in the presence of missing data, although caution should be exercised when missing data are included [57]. Excellent discussions of the challenges of haplotyping with missing data are presented elsewhere [57, 81]. The inclusion of individuals with too much missing data (> 10 per cent) may have a detrimental effect on the reconstruction of phase of individuals without missing data. Finally, markers with non-random patterns of genotyping failure should be redesigned or dropped from the haplotyping set [57, 80].

Software characteristics

In this section, issues related to usability of programs are discussed. User-friendliness is an important issue in the selection of appropriate haplotyping programs, especially in terms of practical usability of programs. Relevant issues include computer system requirements, data format, interface, marker characteristics, run time and sample size.

Computer system requirements: as detailed in the 'platform' column in Tables 1 and 4, not all programs are available for use with all computer operating systems. The selection of a haplotyping program may necessitate investment in new computer equipment and training. Compiling programs to run on new operating systems poses similar challenges.

Data input format: unfortunately, there is no standard data input format. Nearly all of the programs use a unique data input format. Manipulating data from one format to work with another is cumbersome and difficult. HIT and HAPLOSCOPE are platform programs, incorporating several haplotyping programs in one interface. These programs facilitate comparisons of programs on the same datasets.

User interface: the interface is an important component of usability of a haplotyping program. Selection of a program will depend heavily on current knowledge or ability to invest time in learning about a computer system. The majority of identified programs are command prompt driven (see Tables 1 and 4). These interfaces tend to intimidate computer novices or non-computer scientists. Fortunately, several programs with a graphical user interface were identified, including: Arlequin, HAPLOVIEW, HAPLOSCOPE and HPLUS. Finally, individuals familiar with SAS and S-PLUS may be interested in the SAS Genetics module and HAPLO.STATS programs, respectively.

Marker characteristics: many of the widely-used haplotyping programs are limited to biallelic loci. Programs that accommodate multiallelic markers often experience longer run times. Allele frequency is an important consideration in the selection of markers. Low allele frequencies result in low frequency haplotypes that may have little value in explaining common disease variation [49]. Moreover, low frequency haplotypes, for a variety of reasons (eg sampling error, genotyping error, recombination and low LD), are difficult to estimate accurately [29, 30, 36, 49, 50, 53].

Output: in addition to haplotype frequency estimates and assignments, many programs provide measures for evaluating the 'goodness of fit' of constructed haplotypes. A number of EM-based programs provide posterior probabilities of haplotype assignments, including GENECOUNTING, HPLUS, HAPLO.STATS, LDSUPPORT, MLOCUS, PL-EM and SNPHAP. Posterior probabilities are helpful for evaluation of haplotype assignment and any subsequent analyses. Moreover, the probabilities can be used to weight and evaluate assigned haplotypes and frequency estimates [25, 82]. Determination and interpretation of posterior probabilities is difficult for programs that use pseudo-Gibbs samplers, including Arlequin, HAPLOTYPER and PHASE [51, 57, 60]. Finally, Arlequin, HAPLOH, HPLUS and PL-EM provide the variance estimates for the estimated haplotype frequencies.

Run time: another issue in assessing the performance of haplotyping programs involves the programs' use of memory and demands on the central processing unit. Run time is also affected by the complexity of the haplotyping problem, which increases with the number of loci [48, 51]. Although the present EM algorithm can theoretically handle an infinite number of polymorphic sites in a sample, it is limited in practice by its exponentially increasing memory requirements [48, 49]. Moreover, EM methods may require multiple restarts to avoid local convergence and non-global optimum, increasing the time required to infer haplotypes [48]. Using a Gibbs sampler, PHASE v1.0 more efficiently determines phase than the EM algorithm and constructs haplotypes with a larger number of markers, although run times are lengthy [51, 58]. PHASE has been universally recognised as having several useful features, but a very slow implementation [51, 55, 58, 60]. In the original article describing PHASE v1.0, it took minutes to hours to run, whereas an EM program and HAPINFREX took seconds [51]. Among Bayesian-based programs, with 50 subjects and 14-119 loci, HAPLOTYPER estimated haplotypes in seconds, Arlequin v3.0 in minutes and PHASE v1.0 in hours [60]. In comparisons of several programs over complete datasets from Reich et al.,[16] HPLUS and HAPLOTYPER completed analysis in under one second, Arlequin v2.0 in less than one minute and PHASE v2.0 in 11 minutes [55].

Additional comparisons suggest that programs that implement modified EM algorithms, such as SNPHAP and PL-EM, had shorter run times than PHASE v1.0 on large datasets. HAPLOREC has similar run times to the modified EM programs [62]. The updated version of PHASE (v2.0) improves program performance, although it was found still to be slower than the other programs [59]. The phylogeny programs (GPPH, DPPH, BPPH and HAPH) have remarkably fast run times [40, 4345]. HAPH was shown to run faster than both HAPLOTYPER and PHASE v1.0 in a variety of situations [44]. Run times for all programs increased in the presence of missing data and multiallelic markers [54, 60, 62].

Sample size: both sample size and the number of loci are important components for the selection of haplotyping programs. Details on sample size and loci limits are listed in Tables 1 and S1. As sample size increases, both in terms of the number of markers and subjects, the run time increases. The accuracy of EM-based programs has been shown to improve with increasing sample size [4, 53]. Likewise, the accuracy of HAPAR, HAPLOTYPER and PHASE v1.0 were also shown to improve with increasing sample size [39]. Accurate haplotyping of low frequency haplotypes improves with increasing sample size [30].

While standard EM-based programs have no theoretical limit, in practice these programs are limited to fewer than 25 loci, due to memory and processing requirements [48, 49, 51]. HAPINFREX, likewise, has no practical size limits, although the program may fail to start with large numbers of markers [37]. The parsimony program, HAPAR, overcomes HAPINFREX limitations, with accuracy improving with increasing sample size [39]. Programs that accommodate large datasets often sacrifice performance. PL, a divide and conquer strategy, has been proposed as an effective method of dealing with the construction of large haplotypes [57]. This and similar schemes have been implemented in both EM-[5456] and Bayesian-based programs [57, 59, 60, 62]. These programs are able to handle large datasets, although performance varies (see run time discussion above).

Hypothesis testing

Haplotyping in and of itself is usually not the final outcome of interest. The research objective dictates which subsequent analyses are needed. This section will focus on programs that combine haplotyping with hypothesis testing in genetic association studies (see Table 3 and Supplemental Table S3). All haplotype reconstruction methods will encounter a degree of misclassification error or uncertainty in haplotype assignments [7, 81, 83]. If uncertainty of assignments is ignored in subsequent analyses, it can lead to biased parameter estimates and inflated false-positive rates for statistically-based hypothesis tests [25, 31, 82, 83]. In situations where inferred haplotypes had high reliability, biased estimates were avoided, and found to be useful for hypothesis testing [83]. The imperfect phylogeny-based method in HAPH has been shown to assign accurate haplotypes [62] and has recently been updated to include association analysis of discrete and continuous phenotypes, although the potential for bias exists, due to uncertainty of haplotype assignments. Several programs avoid this pitfall by comparing estimated haplotype frequencies between two groups,[84, 85] that is, a case-control model, these include EH, EHPLUS, FASTEHPLUS, GENECOUNTING, PHASE v 2.0, SAS Genetics module and SNPEM. Fallin et al [10]. demonstrated the advantages of this approach using the SNPEM program.

This methodology has been extended to allow adjustment for covariates. The Zaykin [82] program uses a likelihood ratio test statistic for association analysis of haplotypes and phenotypes. HAPLO.STATS [86, 87] and THESIAS [53] also include a test for interaction with covariates using a score and likelihood ratio statistic, respectively. The HPLUS program is limited to qualitative phenotypes, and it provides odds ratio estimates [55, 83]. The THESIAS program has recently been expanded to allow haplotype-based association analysis of survival outcomes [88]. Finally, Arlequin [60, 89] incorporates numerous population genetics tests. Additional discussions on hypothesis testing with haplotypes are available [82, 86, 9094].

Web-based programs

Several web-based haplotyping programs were identified and are presented in Table 2 and supplemental Table S2. Web-based versions of haplotyping programs help researchers to circumvent many of the issues related to practical usability, discussed previously. Web-based programs negate the need for the researcher to learn a computer language(s), purchase computer hardware/software, install and maintain programs or to have to troubleshoot computer problems, thus allowing genetics researchers to focus on what they do best. Moreover, web-based programs usually employ graphical interfaces, allowing the computer layman easily to use a haplotyping program. Additionally, many of the identified web-based programs allow the user to select results sent via e-mail. Finally, additional websites were identified with links to programs, as well as the website for the supplemental tables, also presented in Table 2.

Haplotyping in pooled data

Haplotype analysis using pooled samples is possible, but requires that alleles are in strong LD, are severely limited to a small number of individuals and that only a few of the possible allele combinations are present [95]. This requires actual genotyping of individuals to determine which haplotypes exist in the population of interest before testing for differences in allele frequencies in the two pooled samples [95, 96]. Three programs for pooled samples were identified, as well as one technique, none of which were web-based (see Table 4 and Supplemental Table S4). All of the programs are only compatible with pools of one to six individuals, in which each pool uniquely comprises cases or controls of unrelated individuals. There has been some discussion as to the number of individuals and SNPs that the pooling technique or algorithm can handle [9599]. Pools of three to four individuals are optimal, in terms of accuracy and efficiency. Accuracy begins to decline beyond four individuals [12]. Zou and Zhao [72] point out that pooled samples are particularly susceptible to genotyping error and that consideration should be given to the impact of population stratification in pooled samples.


While no single haplotyping program is ideal in all situations, this review found that currently available haplotyping programs should accommodate the research needs of most scientists. While the programs share many similarities, significant differences were observed in their ability to handle various data characteristics and population genetic parameters. Each program had its own unique combination of features and limitations. It is hoped that researchers interested in haplotype analysis will use this paper as a guide for selecting the haplotype analysis program(s) most suitable for their research needs. Moreover, it is anticipated that this review will be an impetus for additional testing, development and improvement of haplotyping software.

The selection of haplotyping programs should be based on the research needs and characteristics of the data to be used for analysis. These criteria include: research objectives, hypothesis testing, data assumptions, genotyping error, missing data and computer expertise to implement programs, if necessary. A suitable haplotyping program is one that generates the desired results (haplotype frequency estimates and/or assignments) and analyses. For hypothesis testing, several programs were identified that combine haplotype analysis with hypothesis testing, which should facilitate analysis. The accuracy of haplotyping programs varied under different assumptions and situations. It was found that deviations from assumptions often resulted in declines in the performance of haplotyping programs, therefore, an important step in selecting a haplotyping program is the evaluation of the assumptions inherent to collection of the data. This should identify programs that can accommodate limitations or departures from assumptions of the data.

Selection of the appropriate haplotyping programs should also take into account the usability of a program. Assessment of this criterion is challenging because usefulness depends on a number of sub-criteria, discussed previously. Web-based programs and those with graphical user interfaces will generally be the easiest to use and have the best usability. Unfortunately, only a short list of programs may suit the needs of researchers. The usability of a program will also depend heavily on the researcher's computer expertise. In summary, the choice of haplotyping program should be based on identifying research needs and selecting a haplotyping program most appropriate to accommodating those requirements. Awareness of program assumptions and limitations should be an important factor in the final decision.

All of the programs reviewed assume genetic homogeneity of individuals in study populations. In brief, the basis of this assumption is that all individuals in a study population share a similar population history. Inclusion of individuals with dissimilar population histories will result in incorrect haplotype estimates due to, for example, LD differences and allele frequency differences between the populations. As an example, consider a hypothetical population of 200 individuals: half being of African-American ancestry and half of European-American ancestry. The resulting haplotyping estimates will not be correct for either the African-American or European American groups. To obtain accurate haplotype estimates and assignments, the groups must be analysed separately. Further discussions on this topic are available elsewhere [5, 100103].

The majority of the reviewed programs are actively maintained and updated regularly. Haplotyping analysis is a rapidly evolving field, with many new methods and programs emerging. Programs that are reviewed here may be modified or even be completely revamped in the near future. Accurate and updated information on existing haplotyping programs will be maintained at An important limitation of this project is that it relied on a review of literature to evaluate the programs. Therefore, it was not possible to validate the accuracy, performance and claims of all individual programs.

This review found that haplotype analysis programs have increased in number and have improved rapidly over the past decade. While existing haplotyping methods may accommodate research needs, many opportunities exist for improvement of haplotyping programs. In particular, improvements in accuracy (particularly for assignments), faster run time, accommodation of larger sample sets and loci, handling missing data, incorporating association testing and identification and adjustment of haplotype estimates in the presence of genotyping error. In addition, an emerging question is how to construct haplotypes across large genomic regions -- especially with substantial numbers of loci. Available methods include programs that use a block-based approach, methods that build large haplotypes by adding one loci at a time (ie SNPHAP) or programs that use the PL approach (ie HAPLOTYPER, PL-EM). Future studies are necessary to directly evaluate the different measures of accuracy, assess the influence of varying of LD levels on accuracy and further assess the impact of departures of assumptions on program performance and accuracy. Ideally, future studies would evaluate several of the more commonly used programs in a standard fashion, allowing comparison across studies. This would facilitate comparison of programs and determination of the most appropriate program. Moreover, adoption of a universal data format would also be helpful. Finally, the use of a standardised phase-known dataset(s), which developers of haplotyping programs could assess for evaluating their programs, would assist in the selection, improvement and development of haplotyping programs. Potential sources include examples from the literature [4, 18, 65] and the HapMap project data (available at: