Background

With the advent of dense genetic maps of single-nucleotide polymorphisms (SNPs), large population samples of diallelic multilocus genotypes are increasingly available for studies in the fields of population genetics, marker-disease association, and evolutionary genetics. However, current genotyping methods do not provide information on the individual diplotypes (the haplotype pair composing a genotype). This information would nonetheless increase the power of any genetic analysis substantially. Several methods for estimating haplotype frequencies from a sample of genotyped but unphased diploid individuals have been developed. These include a sequential haplotype inference algorithm [1], several expectation-maximization based algorithms [24], a coalescent-based algorithm based on a Markov chain Monte Carlo approach [5], a Bayesian approach that uses a Dirichlet prior distribution for the haplotype frequencies [6], and a recent method based on Bayesian networks that takes account of recombination hotspots, bottlenecks, genetic drift, and mutation [7]. The X chromosome is unique as a population genetics tool because of its diploidy in females and haploidy in males, a characteristic that, among other things, renders its haplotypes accessible to direct count. The potential of the X chromosome to contribute to fine-scale microevolutionary studies (which are dominated by mtDNA and the Y chromosome) has probably been underused [8].

The purposes of the present work were dual. First, we wanted to ascertain whether a haplotype structure could be detected on the X chromosome in a Caucasian population sample, given the genotypes provided by Illumina and Affimetrix for Genetic Analysis Workshop 14. Second, we wanted to test and compare the capability of widely used programs in reconstructing haplotypes, given their distributions in a sample of individuals with known diplotypes; for this purpose, the haplotypes in a sample of unrelated mothers (treated as independent individuals) were determined using the data on their sons, so that the accuracy of different methods of inferring haplotypes from genotype data could be evaluated by comparing the true and the inferred distributions of haplotype frequencies.

Methods

All possible unrelated mother-son pairs of Caucasian ancestry were selected from the 143 families of the Collaborative Study on the Genetics of Alcoholism pedigree files. The diplotypes of the mothers were inferred from the X chromosomes of their sons. An integrated map of the X chromosome for the Illumina and Affymetrix SNP datasets was obtained by querying the NCBI Human Genome (Build 34) for marker position. The Affimetrix dataset was first cleared of the markers with heterozygosity <0.2 in order to render the 2 datasets more homogeneous. The final map included 313 markers (121 Illumina, 192 Affimetrix). It spanned 146.5 Mb at an average density of 0.47 Mb. A large gap (7.8 Mb) was located at 56.5 Mb. Linkage disequilibrium (LD) between pairs of markers was computed by the parameter D'. For multilocus disequilibrium, we defined here an index called D*. This is computed as D* = 1 - (Hd - Hmin)/(Heq - Hmin), where the haplotype diversity Hd is computed as Hd = 1 - ∑ pi2, (pi being the frequency of haplotype i, analogous to the gene diversity of a single locus), the expected haplotype diversity under no LD, Heq, is calculated as Heq = 1 - ∑ E{pi}2 (E{pi} being the expected frequency of each possible haplotype, i.e., the product of the frequencies of the alleles composing that haplotype), and the minimum possible value of haplotype diversity Hmin is obtained computationally. Specifically, if n haplotypes typed for s SNPs are arranged in a n by s matrix and the alleles are coded consistently (e.g., 0 = low frequency allele at all loci), Hmin is obtained by computing Hd with the above equation, after the matrix is rearranged by sorting iteratively each column independently of the others. Another measure of multilocus disequilibrium (the normalized entropy difference, ε) was published previously [9]. We applied both D* and ε to all possible sliding windows of 5 SNPs each.

The following programs were evaluated for their accuracy in inferring population haplotype frequencies: 1) ARLEQUIN 2.001; 2) PHASE 2.1.1; 3) SNPHAP 1.1; 4) HAPLOBLOCK 1.2; 5) HAPLOTYPER 1.0. In inferring haplotypes, ARLEQUIN and SNPHAP use an expectation-maximization algorithm, HAPLOTYPER uses a Bayesian approach assuming a prior Dirichlet distribution of haplotype frequencies, PHASE uses a coalescence-based algorithm for inferring the prior distribution of haplotype frequencies coupled with a Markov chain Monte Carlo approach to approximate the posterior distribution, and HAPLOBLOCK uses a Bayesian network method. ARLEQUIN and SNPHAP ignore the missing data, whereas PHASE and HAPLOTYPER make informed guesses; in HAPLOBLOCK, users can choose between these 2 options. Among all programs, only PHASE include the possibility of specifying a genetic map and modeling the process of recombination. Genotypes at 5 or 10 consecutive markers were selected from the Illumina dataset based on varying levels of D*, and were submitted to all programs. The accuracy of each program was measured using the Pearson correlation coefficient between the true and the inferred haplotype frequencies.

Results

The final dataset analyzed in the present work consisted of 104 unrelated Caucasian females with known diplotypes at 313 SNPs on the X chromosome. Figure 1 shows the parameter D' between all adjacent markers and between each marker and the fifth marker downstream; 84% of marker pairs closer than 100 kb showed high levels of LD, with p-values < 0.01, in comparison with 34% of the pairs 100 to 500 kb apart, and 4.4% of the pairs 500 kb to 2 Mb apart. Two marker pairs with an intermarker distance >3 Mb showed highly significant LD. Then, the multilocus haplotype structure of the X chromosome was investigated by considering sliding windows of 5 markers and calculating both the parameters D* and ε. The 2 measures were highly correlated (r = 0.945). Because of the uneven marker distribution in the maps, the length of the 5-marker segments was highly variable, from 93 kb to 7.94 Mb; in the present analysis, segments longer than 5 Mb were not considered. Several regions of high values of D* (low haplotype diversity) were separated by segments with similar values of Hd and Heq (no LD, Figure 2). One instance of D* = 1.0 (i.e., in which Hd = Hmin) was located at about 56 Mb, near the large gap in the chromosome map. These 5 markers were part of a chromosome segment of 10 markers spanning 1.33 Mb for which only 7 haplotypes, out of 1,024 theoretically possible, were observed. The value of D* for this segment of 10 markers was 0.74. This is consistent with previous reports of a substantial recombination decrease in the centromere of the X chromosome [10].

Figure 1
figure 1

Standardized linkage disequilibrium as a function of intermarker distance. Standardized linkage disequilibrium (D') between markers of the X chromosome as a function of the intermarker distance. Large symbols: D' values with nominal p ≤ 0.01 (blue: adjacent markers; red: LD computed at 5-marker intervals). Dots: D' values with p > 0.01. Marker pairs with distance <1 kb have been omitted.

Figure 2
figure 2

Multilocus LD of the X chromosome. Bars represent sliding windows of 5 markers each, whose D* value is plotted. The line under the chart shows the marker location; a large gap centered at 60 Mb may be noted.

Fourteen series of unphased genotypes with different values of D* (10 series consisting of 5 consecutive markers and 4 of 10 markers) were submitted to each of the 5 programs. Table 1 shows Pearson correlation coefficients between the observed and the inferred haplotype frequencies. For 5-marker haplotypes, the correlation coefficients were high even in situations of moderate LD for all programs. In the case of 10-marker haplotypes (last 4 rows in Table 1), all the programs reconstructed perfectly well the true haplotype distribution when the number of different haplotypes in the sample was small in comparison with the total number of possible haplotypes. With the increase of haplotype diversity (series 11 in Table 1), the performance of the programs started to decrease and differentiate, though the correlation between the true and the estimated haplotype frequencies was still high; PHASE realized the best performance (r = 0.996). In the opposite situation, when the haplotype diversity was high (130 different haplotypes in a sample of 104 individuals) the performances were generally poor; only PHASE realized a high correlation coefficient (0.737). When the majority of the haplotypes is unique (last row in Table 1), the inferred haplotypes are clearly unreliable.

Table 1 Correlation between the true haplotype frequencies and those estimated by five programs

Discussion

We investigated the large-scale haplotypic structure of the X chromosome in a Caucasian population sample by computing D' for all adjacent markers and any fifth marker; high levels of LD were detected even at distances > 1 Mb. We then applied to all possible segments of 5 consecutive markers a measure of multilocus LD, here called D*. This parameter is easily computed and is based on the standard definition of heterozygosity; D* reaches its maximum possible value of 1.0 when the haplotype diversity is at a minimum, i.e., when LD is complete. Thus, D* appeared to be a suitable measure in studies of large-scale multilocus linkage disequilibrium. In addition, we wanted to test the capability of widely used programs in reconstructing the haplotypes of population samples. All investigated programs perform well when the number of markers is small (5) even in situations of low values of D*. With a higher number of markers (10), high correlation values between true and inferred haplotype frequencies are attained only in conditions of high D*. PHASE is an exception, in that it has reconstructed the true distribution of haplotype frequency with good accuracy even in a difficult situation. This program employed significantly more computing time than the others (10–20 minutes in comparison with less than a second using in the same machine), with the exception of HAPLOBLOCK, which ran for more than 30 hours.

Conclusion

The SNP haplotypic structure of the X chromosome is complex, with regions of high haplotype conservation (most notably, around the centromere) interspersed among regions of higher haplotype diversity. A more detailed definition of this structure, to be accomplished in further studies, could be useful in evolutionary analyses and in disease association studies.

All the tested programs were accurate (r = 1) in reconstructing the true distribution of haplotype frequencies in case of high LD. Only the program PHASE realized a high correlation coefficient (r > 0.7) in case of low linkage disequilibrium.