Introduction

During human history, the process of spreading over first Africa and then the other continents divided humans into more or less discrete populations defined by distinct cultural practices. Periods of separation, selection and population isolation laid conditions for varying degrees of genetic differentiation (e.g. Johansson and Gyllensten 2008). The re-contact and admixture of these more or less discrete groups has resulted in new distinct populations comprised of two or more founding groups. Such admixed groups often constitute well-recognized ethnic categories in countries throughout the world.

Historically, recent admixture from populations of differing continental origins defines such groups in the USA and Brazil, for example. Yet the admixture of groups from intra-continental sources has also been a familiar aspect of human population history. For example, the largest human population group, the Han Chinese, have been shown to consist of distinct subpopulations reflecting diverse origins, partial isolation and subsequent admixture (Hu et al. 2007; Chen et al. 2008). Even within national boundaries, European populations are also the product of waves of invasions and subsequent admixture reflected in their current genetic composition. As we demonstrate here, it is straightforward and unambiguous to reconstruct greatly diverged parental populations.

The HLA system, including the loci of the human major histocompatibility complex at 6p21, comprises the most polymorphic system in the human genome. It evolves rapidly and as such constitutes an excellent marker system for identifying and following the parental population contributions to contemporary admixed groups. Given the HLA-typed samples of an admixed population and one of its founder populations, a method to identify the HLA haplotypes of the missing founder population is described and applied.

Methods

Model for division of samples

If two founder populations PA and PB admix to produce a new admixed population, PN (Fig. 1), then the frequencies of each HLA haplotype i will be related:

Fig. 1
figure 1

Admixture model of two populations PA and PB combining to produce a new third population, PN, for each haplotype i

$$ {\text{MPAi}} + (1 - {\text{M}}){\text{PBi}} = {\text{PNi}} $$

where M is the proportion of PA in the new population PN. Given that PA and PN can be sampled and typed, then the frequency of each haplotype PBi in population PB can be expressed:

$$ ({\text{PNi}} - {\text{MPAi}})/(1 - {\text{M}}) = {\text{PBi}} $$
(1)

In this formulation, the admixture rate M is estimated from outside sources. The preferred sources for admixture estimates are genomewide studies with pure parental population controls (e.g. Price et al. 2007; Risch 2006). The geographical scale of the estimate should be considered. For example, African American composition has been shown to vary across the United States, with African Americans from the US South having a higher fraction of African background genetics. Large-scale estimates can be achieved by averaging a set of small-scale studies in different locales to get a nationwide average, but care should be taken when using estimates from smaller population and sample locations.

While the vast majority of haplotypes are private to different continental populations, some haplotypes are found in both populations, but at differing frequencies. Our method handles haplotypes both shared and private.

Sampling variation may result in negative frequencies when the product of the admixture proportion M and the frequency of the haplotype in population PA exceeds the frequency of the haplotype in the admixed population PN. In these cases, the haplotype is estimated not to exist in PB, so the frequency of the haplotype in PB is set to zero. After the initial determination of all haplotype frequencies, PBi, all haplotypes are normalized to sum to one.

Selecting pure ancestral population samples

Once the ancestral population frequencies have been calculated, it is possible to split the list of genotypes from the admixed population sample into separate categories for individuals with pure ancestry and those with mixed ancestry. This has applications in disease association studies for generating sets of case/control populations with the same ancestral background. We have used a Bayesian approach using the ancestral haplotype frequencies to assign an ancestral label to the haplotype of each individual. The weighting parameters are adjusted until the desired admixture proportion is reached. We consider the admixture proportion at the haplotype level to count individuals with mixed ancestry towards the total level of admixture (Eq. 2).

$$ {\text{Total\_PR}} = \sum\limits_{{{\text{pop}}1}} {\sum\limits_{{{\text{pop}}2}} {({\text{HapFreq1}}_{\text{pop1}}^{*} } } {\text{HapFreq2}}_{\text{pop2}} ) $$
(2)

Estimating admixture from HLA data

We can calculate admixture proportions using HLA data when no estimate is currently available from other sources. With all source populations characterized, one can calculate the relative contribution in an admixed population where admixture proportion is unknown. The solution is to solve for the admixture proportion where the difference between the linear combination of source populations and the admixed population is minimized (Eq. 3).

$$ \mathop {\min }\limits_{\text{Admix}} \mathop \sum \limits^{\text{haplo}} ({\text{Admixed\_HF}}_{\text{haplo}} - ({\text{Source1\_HF}}_{\text{haplo}}^{*} {\text{Admix}} + {\text{Source}}2\_{\text{HF}}_{\text{haplo}}^{*} (1 - {\text{Admix}}))) $$
(3)

In admixed populations with more than two founder populations, such as Caribbean Hispanic populations that have a mixture of African, Native American and European ancestry, the same method can be applied to calculate the frequencies of a single missing founder population when all the other founder populations have been characterized and admixture estimates provided for each component.

Population data

The National Marrow Donor Program in Minneapolis, MN maintains a donor registry including individuals of African Americans and of European Americans. A total of 1,000 individuals from each registry group were randomly selected. We utilized National Marrow Donor Program data recently reported in the literature (Maiers et al. 2007).

HLA typing and haplotype inference

HLA typing was performed at the antigen or two-digit level of resolution at the loci HLA A, HLA B and HLA DRB1 using DNA methods. Three locus haplotypes, for example A*32-B*42-DRB1*03, are abbreviated 32:42:03.

For haplotype inference, we used standard methods adjusted to accommodate the large haplotype diversity present in the HLA system. We applied the expectation-maximization (EM) algorithm to infer three locus HLA haplotypes from genotypes. Estimation of frequencies of rare haplotypes in founder populations is highly prone to error. There are several sources of possible error. Inadequate sampling of populations results in frequencies that have a wide error bounds due to statistical variation in sampling a small proportion of the overall population. Estimation error is an artifact of the EM algorithm where rare haplotypes in the sampled populations are difficult to ascertain due to lack of information. Admixture estimation error affects the frequencies of the missing founder population calculation based on the accuracy of the admixture proportion into the admixed population. Some haplotypes may have been created by recombination or mutation after the merging of the two founder populations. This method assigns these haplotypes to the missing founder population.

Application

The method of derivation of founder population haplotypes can be demonstrated with HLA-typed samples of European Americans and African Americans. African Americans are derived from West Africans and Europeans in the proportions of approximately 80:20 (e.g. Zhu et al. 2005). This example is especially informative for these purposes because of the great (intercontinental) divergence in HLA haplotypes between the peoples of Africa and Europe (Mack and Erlich 2006). Infrequent but genuine haplotype similarities or the possibility of low levels of African Admixture in the European American sample (Shriver et al. 2003) will not detract from the utility of this example because of the substantial differences between the two founding populations. In order to estimate the HLA haplotype frequencies of the West African founder populations, we took samples of 1,000 individuals (2,000 haplotypes) typed at the “antigen level” (2-digit) for African American and European American donor samples from the National Marrow Donor Program registries (Maiers et al. 2007).

Estimated three locus haplotypes were sorted by frequency and the ten most common haplotypes displayed by order of each of the two founder populations along with the second founder population and the admixed (African American) population sample (Fig. 2a, b). The most common European American haplotype, A*01-B*08-DRB1*03, present at a frequency of 0.067 in the European American sample, was present in the African Americans at a frequency of only 0.007 or 10.4% of that seen in European Americans. It was entirely absent in the derived West Africans. For the top ten haplotypes the African American frequencies were present near 20% (mean = 20.7%) of the frequencies observed in the European Americans (Fig. 2a). The derived West African A-B-DRB1 haplotypes were in fact not observed in the European Americans at four of the ten possible instances, i.e. detectable at a frequency of ≥0.001.

Fig. 2
figure 2

a Frequencies of the 10 most common European HLA haplotypes (EUR) plotted against African American (NAFA) and derived West African (AFR) haplotype frequencies, b Frequencies of the 10 most common derived West African HLA haplotypes plotted against African American and European American haplotype frequencies

The most frequent ten West African HLA haplotypes similarly arranged and compared with the frequencies of the African American and European American haplotypes are shown in Fig. 2b. The most common West African haplotype is A*30-B*42-DRB1*03, present at a frequency of 0.021. African American frequencies averaged 92% that of the estimated West African frequencies. The two European American haplotypes observed in the West African sample were quite rare and may be due to haplotype estimation errors with frequencies of only 0.00054 and 0.00026 for haplotypes 23:15:11 and 74:15:13, respectively.

Although a fuller description of founder populations estimated from African Americans and other groups will be presented separately, some points are worth making at this time. This example, comparing differences in HLA frequencies between two continental regions, suggests that there may be complete population differentiation in HLA types at the continental level, with little or no sharing of haplotypes. Further underlining this point, the two-digit antigen level of HLA typing resolution presented here often contains a great deal of additional allelic variation, which can make a sizeable contribution to haplotypic variation. For example, the common alleles seen in Europeans, B*44 and DRB1*15, each consist of dozens of subtypes. An additional source of further HLA haplotype variation is present in the other histocompatibility loci also present in the HLA complex. We suggest that samples typed at high resolution and at addition HLA loci would further reduce instances of haplotype overlap between European and African source HLA haplotypes.

Discussion

Historically admixed populations have gained attention in recent years because of their potential for admixture mapping of disease genes (Smith and O-Brien 2005; Patterson et al. 2004; Wang et al. 2008; Xu et al. 2008). Our goal in this contribution is to demonstrate a method to re-create parental populations of an admixed group, when one of the parental populations is available, especially pertinent to HLA information. The HLA region of humans is composed of the highly polymorphic major histocompatibility loci distributed over a region of 3–4 Mb. The high diversity of this region is much greater than the sum of the allelic variation from each of the 8–10 histocompatibility loci.

Population samples of HLA frequencies derived by this method can be of value in several respects. First, one or more of the founder populations of a contemporary group will often be unavailable or impossible to sample, making the reconstructed samples of unique value. In addition, a population’s HLA composition is an essential starting place for determining the sampling requirements for an ethnically specific bone marrow or stem cell registry, and in understanding the practical side of population differentiation for patient–donor matching. It is possible to stratify admixed groups based on inference of their HLA haplotypes coming from two ancestral sources or a single population. Patients with mixed ancestral HLA will be among the least likely to find a match because population samples with similar ancestral mixtures may be difficult to obtain.

This work describes a method of reconstructing the haplotype frequencies of a founder population. For this purpose, we use only a relatively small sample of available population data (1,000 European Americans and 1,000 African Americans) and limit the description of haplotypes from the derived population. A more complete and thorough study founder HLA haplotypes from African American and other admixed populations will be reported separately, and will address relative subtleties of the method such as the adequacy and purity of an available founder population (e.g. European Americans) and present more substantial lists of derived haplotypes. Another issue to be addressed at that time is the apparent sharing of haplotypes from populations of intercontinental origins.

The evolution and modification of haplotypes of the HLA complex have been studied over many years, yet haplotype blocks present throughout the genome also evolve through the same variety of genetic mechanisms as seen in the HLA system. Our method applies to not just the HLA system, but to other haplotype frequencies in the genome. The HLA system is particularly remarkable for the availability of quality data and the population privacy of its haplotypes. Datasets are often available using other genetic marker systems, raising the possibility for this same type of analysis on SNPs or microsatellites. In fact, forays have already been made into analyzing admixed populations with a variety of genetic systems (Bertorelle and Excoffier 1998; Mountain et al. 2002; Choisy et al. 2004; Pfaff et al. 2004; Price et al. 2007). It appears that the HLA system may one end of the spectrum of population haplotype divergence in humans.