Introduction

Genetic analysis of phenotypes and diseases has traditionally followed two approaches: family-based linkage analysis and population-based association studies. While in linkage analysis it is the co-segregation of alleles in families that is measured, population-based studies use non-random associations between phenotypes and alleles in populations to identify causative genes. Linkage analysis has proven to be immensely successful as a means of identifying genes for a number of single gene diseases with simple Mendelian inheritance (eg see OMIM database). Complex diseases are multifactorial, polygenic and often characterised by late age of onset, incomplete penetrance, locus heterogeneity and environmental exposures and, despite significant efforts, have not been amenable to family-based mapping.

Linkage disequilibrium (LD) is an important aspect of genetic association studies and is generated in a population through mutation, selection, drift, non-random mating and admixture [1]. Allelic associations due to LD are significant and are correlated with physical distance within small genomic regions but decay over time due to recombination [24]. LD-based association studies have been successful in both fine scale mapping [5, 6] and initial disease gene mapping in homogeneous populations that have undergone recent bottlenecks (eg Hirschsprung disease in Mennonites [7], Bardet- Beidle syndrome in Bedouins [8]). Allelic associations can result either from direct functional effects of the alleles tested or indirectly through non-random associations between the allele measured and nearby functional alleles. Since functional alleles in most genes are still unknown and are indeed an object of the research, LD is an important feature of how genes can be screened for alleles that alter disease risk. Thus, there has been substantial focus on the extent of LD across the genome and the definition of statistical methods for disease gene mapping using LD [911]. In large cosmopolitan populations, however, LD may be difficult to detect when the mutation is old, since the amount of remaining LD may be small. Additionally, false-positive associations due to population stratification are important confounders in LD-based association studies.

Admixture studies and their use in disease gene mapping

Intermixture between previously isolated populations leads to the creation of admixed populations. The process of admixture itself creates LD between all loci, linked and unlinked, that have different allele frequencies in the parental populations. The magnitude of admixture linkage disequilibrium (ALD) in an admixed population depends on the allele frequency differential between the parental populations, the level of admixture, the admixture dynamics, the time since admixture and the recombination rate between the loci [12]. While ALD between unlinked markers decays rapidly (within two to four generations), ALD between linked markers decays more slowly. The exponential decrease in ALD with genetic distance facilitates the differentiation of ALD that is high between markers that are close together and genetically linked, from ALD generated at unlinked loci. Thus, if the parental populations differ in a trait or disease due to different frequencies of risk alleles, it should be possible to identify the loci containing these alleles using admixture mapping (AM) [1214].

Many US residents can trace their genetic ancestry to more than one continent. The European colonial period that started in the late 1400s brought together in the New World populations that had been geographically isolated, namely, Europeans, West Africans and Native Americans. Given the recent and common origin of all human populations, this admixture had only a small average effect on the gene pools of these new populations. In other words, for most genomic regions, the pre-colonial (or parental) populations had similar allele frequencies and, at these, admixture was of little consequence. At some other loci, however, there had been some change in allele frequency in the time since the separation of parental populations and it is at these loci where admixture has had an important effect. Since populations like African Americans, African Caribbeans and Mexican Americans were formed in the recent past, allelic associations in these populations that were created by admixture extend over large distances. Admixed populations represent a useful resource for mapping complex-disease genes by using this long-range ALD [12], which requires fewer markers to screen the genome than other populations or approaches. Understanding the genetic consequences of admixture is important because it can be both a confounding factor and a source of statistical power in gene identification studies.

Two models of admixture dynamics have been described to represent the extremes of the process by which an admixed population is formed: the continuous gene flow (CGF) model and the hybrid isolation (HI) model [15, 16]. In the HI model, admixture occurs immediately in a single generation without further contribution from either parental population, hence, ALD is generated in a single generation and gradually decays in successive generations through independent assortment and recombination between loci. Few false-positive results are thus expected in an association study under the HI model. Alternatively, the CGF model represents a situation where admixture occurs at a steady rate in each generation, with contributions from one (or all) of the parental populations into the admixed population. ALD under the CGF model increases in each generation, since new admixture is constantly occurring. A point will be reached, however (when the admixture proportion = 0.5), where continued admixture will actually decrease the ALD, since added gene flow will result in the conversion of the admixed population into the introgressing parental population. Figure 1 shows the amount of ALD expected under these two models for linked and unlinked loci. For both models, association between markers is inversely correlated with the genetic distance between them. Simulation studies have shown that populations that have a demographic history more consistent with the CGF model of admixture retain ALD over larger chromosomal regions and show significant associations between unlinked marker loci [15]. While associations between unlinked markers could potentially lead to false-positives, conditioning upon parental admixture allows the distinction between associations arising due to true linkage and those due to CGF stratification to be made, thereby providing greater power for detecting ALD over larger chromosomal distances [15].

Figure 1
figure 1

The amount of admixture linkage disequilibrium (ALD) expected under the continuous gene flow (CGF) and hybrid isolation (HI) models of admixture for unlinked loci and loci linked at 5 cM. The results shown are for two loci with δ = 0.54 and 0.49, and with 50 per cent admixture in the first generation for the HI model and 1.9 per cent admixture for 36 generations under the CGF model (equivalent to 50 per cent total). ALD under the HI model decreases for both linked and unlinked loci, whereas ALD under the CGF model for both linked and unlinked loci increases initially and then decreases (adapted from Pfaff et al., 2001 [15])

There are several ways in which admixture can be an important resource in the elucidation of genetic factors that contribute to the risk of common disease. Common diseases often have environmental components to their risk, and the clinical phenotype results from currently unknown interactions between environmental factors and underlying genotypes. Decomposing the sources of variation is thus important in order accurately to understand the aetiology of the trait. It is possible to distinguish between the genetic and environmental explanations for ethnic differences in disease risk (and investigating the mode of inheritance), by studying the relationship of disease risk to individual admixture [14, 1719]. For example, recent studies have demonstrated a strong relationship between proportional West African ancestry and the risk of systemic lupus erythematosus in admixed populations in Trinidad [18]. Several common diseases (eg hypertension, diabetes, obesity, prostate cancer and osteoporosis) have differences in risk among population groups (see Table 1). In situations where these differences have a genetic basis, genes underlying these differences can be identified by testing for locus ancestry by conditioning on parental admixture. As detailed by Shriver et al., this approach has a greater statistical power than family linkage studies for mapping polygenic traits [14]. Estimates of biogeographical ancestry (BGA), the proportional ancestry levels of an individual, can be used in conjunction with measured environmental effects for investigating the roles of environmental and inherited risks underlying complex traits [1820]. It is important to recognise that associations between individual admixture and disease risk might reflect correlations between BGA and socio-cultural variables and exposures. For example, hypothetically, if BGA and years of education were to be correlated, hypertension might be correlated with BGA, even though the causal risk factor was years of education or vice versa.

Table 1 Diseases with possible genetic components based on ethnic differences in disease rates and hence amenable to admixture mapping

Marker choice for admixture mapping

Admixture-based methods rely on using suitable markers and estimates of allele frequencies from appropriately identified parental populations. Since ALD is fairly new and extends over larger distances, fewer markers are required for AM studies. Markers informative for ancestry have been used in several contexts and have been referred to as 'ideal,' [21] 'private' [22] and 'unique' [23]. Informativeness of such markers can be measured as the allele frequency differential (δ), which is the absolute value of the difference of a particular allele between populations [12, 24]. Microsatellites and insertion/deletion polymorphisms with δ > 0.3 were recently called 'ethnic-difference markers' (EDMs) [25] suitable for mapping by admixture linkage disequilibrium (MALD). Additionally, markers with high δ and very high log likelihood allelic ratio (LLAR) between populations have been designated 'population specific alleles' (PSAs) [26]. This report followed from earlier work where markers with large allele frequency difference were identified to be appropriate for admixture studies [27, 28], and most (> 95 per cent) of the arbitrarily identified biallelic markers had δ < 50 per cent [24]. Thus, the authors proposed that ideal PSAs should have δ > 50 per cent and also indicated that for multiallelic loci, a composite δ could be estimated as one half the summation of the absolute value of allelic frequency differences for all alleles at that locus [26]. It has also been shown that markers with lower δ values, of approximately 30 per cent, can provide up to 80 per cent power for detecting associations at distances of 5 cM with a large enough sample size (N = 1,000) [15].

Pfaff et al. [15], suggested referring to markers suitable for admixture studies as 'ancestry informative markers' (AIMs), given that the central feature of these markers is the ancestry information content (f) [29]. The present authors agree that the term AIM more accurately describes these markers and does so using language that is less likely to be misunderstood and misinterpreted [14, 17, 28]. Marker information content 'f' denotes the locus-specific Fst and is a value representative of the differentiation between two populations at a single locus. This is equivalent to Wahlund's standardised variance for allele frequency. Simulation studies for estimating the information content of markers with varying levels of f have shown that for 1,000 markers with average information content for ancestry at 40 per cent between two ancestral subpopulations, approximately 80 per cent of the information about ancestry can be extracted from an initial genome screen [13, 29]. After initial identification of regions showing admixture, more markers can be typed in these regions to increase extraction of information to nearly 100 per cent.

It is well established, however, that only 5-15 per cent of the total genetic variation results from differences among human populations [3032]. Moreover, most alleles are shared between populations, and alleles common in one population are also common in other populations. Thus, most genetic markers are unaffected by admixture and it is imperative to choose markers that show high levels of d (and f) between the parental populations. Recent studies by several groups have focused on identifying panels of markers suitable for admixture studies. One notable study screened 744 microsatellite markers for composite d values and LLAR in four different populations and identified a genome spanning set of 315 markers (average spacing 10 cM, δ ≥ 0.3) for mapping in African Americans and 214 markers (average spacing of 16 cM, δ ≥ 0.25) for mapping in Hispanics [33]. A DNA pooling method was used to identify 151 AIMs (microsatellites and short insertion/deletion polymorphisms), with δ > 0.3 for mapping in Mexican American populations to distinguish between European-American and Native-American contributions [25]. Ninety-seven AIMs were identified for mapping in African-American populations [25] that show limited variation within Africa [34]. The authors' group has reported AIMs over the past few years [14, 17, 26, 35, 36]. Additional resources are available for obtaining marker frequency, and genotype and haplotype information, from The SNP Consortium (TSC; http://snp.cshl.org), the National Center for Biotechnology Information's 'dbSNP' website (http://www.ncbi.nlm.nih.gov/SNP), the Marshfield Database (http://research.marshfieldclinic.org/genetics/Default.htm) and the ongoing HapMap project.

Admixed populations and admixture proportions

Since the amount of ALD created is proportional to the level of admixture in a population, it is important briefly to review studies on admixture levels across populations. Those populations that are likely to be useful for admixture studies include African Americans, Mexican Americans, Cubans and Puerto Ricans in the USA, African Caribbeans, various Latin American populations, various groups in Central and South America and the Caribbean islands, Anglo Indians in India and 'coloured' populations of South Africa. Various statistical approaches have been used to estimate admixture proportions in these populations and have been reviewed in detail elsewhere [37]. These include a least squares method, a weighted least squares method [16, 38, 39] and likelihood methods [38, 40]. A recent review of admixture studies and admixture proportions of various Latin American populations is provided by Sans [41]. African Americans are a well-studied group with substantial European and West African contributions and a smaller Native American contribution [27, 35, 42, 43]. A survey of current literature indicates that European admixture ranges from 3.5 per cent in the Gullah Sea Islanders of South Carolina [35], to 28 per cent in New Orleans [35]. Admixture estimates in African-American populations can be highly variable across the USA, which is likely to reflect local variation in the demographic histories and social norms.

US Hispanics form a complex socio-political conglomerate including Puerto Ricans, Cubans, Spanish Americans, Mexican Americans. Various groups from Central and South America can also be studied using ancestry AM. The proportional contributions from parental Europeans are estimated to be the largest, followed by a substantial Native American ancestry and varying amounts of West African ancestry [16, 17, 44]. In a sample of Mexican Americans from Arizona, the admixture estimates obtained using a weighted least squares method showed 29 ± 4 per cent Native American, 68 ± 5 per cent European and 3 ± 2 per cent West African contribution [16]. A recent study reports the following estimates for a Hispanic population from the San Luis Valley, Colorado: 62.7 ± 2.1 per cent European, 34.1 ± 1.9 per cent Native American, 3.2 ± 1.5 per cent West African [17]. In Puerto Ricans from New York City, the estimates obtained were 53.3 ± 2.8 per cent European, 29.1 ± 2.3 per cent West African, 17.6 ± 2.4 per cent Native American [17]. In a separate Mexican-American population sample from California, European ancestry was estimated to be 60 per cent and Native American contribution was estimated at 40 per cent [25]. As with African-American populations, there is substantial variation across populations. From these results, it is evident that, when studying any new admixed population sample, it is important to accurately determine the proportional contributions and not to rely on previously obtained estimates from a similar population. Additionally, it is instructive to have information on the levels of stratification related to admixture that are present in the population under consideration [15].

Ancestry-phenotype correlations; phenotype and complex disease gene mapping

Traits and diseases more prevalent in one population than in others are amenable to admixture analysis and some examples are listed in Table 1. Most of the diseases shown in this Table have a complex aetiology affected by multiple genes and environmental factors. Earlier studies [45, 46] focused on admixed populations as units of analysis in exploring relationships between ancestry and phenotypes [12]. These authors showed that non-insulin-dependent (Type 2) diabetes mellitus prevalence is correlated with admixture proportions among a selection of populations with varying levels of Native American ancestry. Data like these provide compelling evidence for frequency differences in risk modifying alleles, but such data have not been collected for many diseases. Another related approach is to test for individual admixture-phenotype correlations within an admixed population. Correlations between ancestry and phenotypes have been detected and reported by various authors [14, 1719, 44, 45, 47].

A prerequisite for testing ancestry/phenotype correlations is the presence of stratification related to admixture, which will be evident in variation in individual ancestry levels. Figure 2 shows the distribution of BGA estimates from three examples of Hispanic population samples, Puerto Ricans from New York, Mexicans from Tlapa, Mexico and Hispanics from the San Luis Valley, Colorado [17]. Substantial variation is observed in all three samples. With the San Luis Valley group, more variability is observed on the European-Native American axis, while the New York group is more variable on the European-West African axis. Following the argument of Chakraborty and Weiss [48], admixture proportions should be correlated with diseases/traits that differ in populations due to underlying genetic differences. In each of these population samples, strong positive correlation was observed between individual ancestry and skin pigmentation measured as melanin index 'M' or lightness index 'L' (Figures 3A, 3B and 3C). A significant negative correlation was also observed between the proportion of West African ancestry and bone mineral density (BMD) in the Puerto Rican sample [17]. Proportion West African ancestry and skin pigmentation (measured as melanin index) in individuals is also correlated in African Americans from Washington DC and African Caribbeans from the UK, but not in European Americans from State College, Pennsylvania (Figure 4) [14]. Recently, correlations have been observed between proportion West African ancestry and lower insulin sensitivity, higher fasting insulin and acute insulin response to glucose in a combined sample of African-American and European-American children [20]. In a separate sample of African-American females, West African admixture is associated with body mass index, fat mass, fat-free mass and BMD [19]. It is important to keep in mind that ancestry-pheno-type correlations are dependent on both the existence of functional alleles at different frequencies in parental populations, and significant stratification related to admixture. Although most admixed populations tested to date are structured, there is variation in the amount of stratification present, and this structure should be tested for explicitly when investigating a new population [15, 42, 49].

Figure 2
figure 2

Triangle plot showing biogeographical ancestry of three Hispanic populations. Each vertex represents a parental population, which for this plot are Europeans, West Africans and Native Americans. The three populations shown are Hispanics from the San Luis Valley (blank circles), Puerto Ricans from New York City (grey diamonds) and Mexicans from Tlapa, Mexico (grey triangles) (adapted from Bonilla, 2003 [17])

Figure 3
figure 3

The relationship between proportional ancestry and skin pigmentation in three Hispanic populations. For all populations, proportional ancestry was estimated using the maximum likelihood (ML) method (adapted from Bonilla, 2003) [17]. (A) Percent Native American ancestry versus lightness index (L) in Hispanics from the San Luis Valley, Colorado (ancestry estimated using 22 AIMs). (B) Percent Native American ancestry versus melanin index in Mexicans from Tlapa, Mexico (ancestry estimates using 29 AIMs). (C) Percent African ancestry versus melanin index (M) in Puerto Ricans from New York City (ancestry estimated using 35 AIMs)

Figure 4
figure 4

The relationship between percent African ancestry and skin pigmentation in three populations. Percent African Ancestry (obtained using 34 AIMs and calculated by the maximum likelihood (ML) method) and the melanin index (M) are shown for three populations, European Americans from State College, Pennsylvania (diamonds), African Americans from Washington, DC and State College, Pennsylvania (squares) and African Caribbeans from Britain (triangles). (With permission from Shriver et al., 2003 [14])

Methods developed for admixture analyses/study design

Theoretical and experimental studies have explored the parameters that characterise and affect admixture studies [15, 24, 28, 35, 42, 50, 51]. The acronym MALD was proposed [28, 50] to designate the mapping method proposed originally by Chakraborty and Weiss, which exploited the long range allelic associations created through ALD [12]. Parameters critical for MALD include the genetic distance between markers and disease locus (θ); number of generations since admixture (t); proportion of admixture (m) from one parental population; the allele frequency differential (δ) between parental populations; and sample size (N) [12, 28, 52]. Simulation studies suggest that sample sizes of 200-300 patients, typed for 200-300 evenly spaced markers, each having allele frequency differentials >0.3, have a >95 per cent chance of locating the causative gene, when there has been no new admixture from the parental population in the last four generations and no other sources of population structure or sample heterogeneity [28, 50].

Other approaches proposed for using admixture include a method based on the transmission disequilibrium test (TDT) [53] that assesses excess transmission of alleles derived from high-risk ancestors to affected offspring of parents who are heterozygous at the marker locus, containing one allele from each of two ancestral populations [52]. A second TDT-based likelihood approach was developed that compared the transmission of haplotypes with non-transmission in affected offspring in an admixed population following a multipoint method. It obtained a likelihood statistic to determine the significance of various models under different scenarios [54].

One fundamental limitation of MALD as initially described and in its early extensions, is the effects of stratification on causing false-positive association [12, 24, 28]. The TDT is one means of correcting for this stratification. Another is by conditioning on parental admixture [29]. Marker data at all loci are combined to estimate ancestry of alleles at each locus. When allelic ancestry at marker loci is known, this approach is analogous to a linkage analysis, hence the term AM is more appropriate than MALD for describing this method and to distinguish it from LD approaches [13, 14, 29]. The underlying variation in ancestry of chromosomes of mixed descent is modelled to extract all of the information about linkage that is generated by admixture. For example, where a locus is assumed to account for variation in skin pigmentation between two parental groups, eg West Africans and Europeans, individuals can be classified according to whether they have 0, 1 or 2 alleles of West African descent at this locus. By comparing these three groups for mean pigmentation level, holding all other factors constant, variation in pigmentation can be observed depending upon the number of alleles of West African ancestry in an individual. Controlling for parental admixture eliminates association of the trait with ancestry at unlinked loci. By removing the background effects of ancestry, it is possible to observe the locus-specific effects on a trait/disease [14, 17]. Allelic ancestry at a locus is inferred from the marker by using the conditional probability of each allelic state given the ancestry-specific allele frequencies. A complex hierarchical model with many nuisance parameters is used to model the distribution of admixture in the population. This is implemented using the ADMIXMAP program (at http://www.lshtm.ac.uk/eph.eu/GeneticEpidemiologyGroup/htm), which follows a Bayesian approach with Markov chain simulation, and incorporates the admixture of each individual's parents and the random variation of ancestry on chromosomes inherited from each of the parents in the model [13, 14, 29].

Variation in individual admixture introduces population stratification, which in turn can inflate the number of significant associations that are observed [53, 55, 56] and is a potential confounder in association studies [29, 5759]. Various statistical approaches have been developed to detect and control for stratification within a population sample [14, 15, 17, 42, 6062]. For example, the Dt/D0 test examines the relationship between the observed LD and the predicted ALD between unlinked marker pairs for detecting structure within the sample. Using individual ancestry as a conditioning variable in analysis of variance tests, it is possible to eliminate association of the trait with unlinked alleles [14, 17]. The Bayesian approaches implemented by McKeigue et al. and Pritchard et al [13, 61]. offer an advantage over classical maximum likelihood based methods [44, 63] by allowing for missing genotype and ancestry data and modelling admixture hierarchically. Methods have been developed to control for parental admixture [29] and to account for uncertain BGA estimation [59].

Recent studies and future directions

Several theoretical and practical studies indicate that AM approaches promise to be suitable for identifying genes causing complex diseases. Methodological advancements have been made to offset the potential problems arising from association between unlinked loci by conditioning on parental admixture [13, 29], and to detect and correct for population stratification [59, 60]. Use of Bayesian AM [13, 29, 59] can take into consideration various uncertainties, including missing data values for estimating admixture proportions, and can overcome problems arising out of mis-specification of parental allele frequencies and promises to be an effective tool for admixture studies. This method, which is different from the classical disequilibrium-based approach that is more commonly used, is perhaps more suitable for disease gene mapping in admixed populations and has already been successfully used for mapping [14]. Table 2 summarises recent studies showing associations between ancestry and phenotypes/diseases and instances where AM was used to identify genes. Currently, the primary impediment to exhaustive AM genome scans is the lack of verified AIM panels. Sufficient numbers of markers are available as candidate AIMs, but effort and resources are required to confirm these markers and to generate accurate parental allele frequencies. Efforts are currently underway in several laboratories to identify more AIMs for this purpose. It seems inevitable that more such studies will be carried out in the near future to utilise the immense potential of this approach.

Table 2 Diseases showing ancestry-phenotype correlation