Multilocus OCA2 genotypes specify human iris colors
Human iris color is a quantitative, multifactorial phenotype that exhibits quasi-Mendelian inheritance. Recent studies have shown that OCA2 polymorphism underlies most of the natural variability in human iris pigmentation but to date, only a few associated polymorphisms in this gene have been described. Herein, we describe an iris color score (C) for quantifying iris melanin content in-silico and undertake a more detailed survey of the OCA2 locus (n = 271 SNPs). In 1,317 subjects, we confirmed six previously described associations and identified another 27 strongly associated with C that were not explained by continental population stratification (OR 1.5–17.9, P = 0.03 to <0.001). Haplotype analysis with respect to these 33 SNPs revealed six haplotype blocks and 11 hap-tags within these blocks. To identify genetic features for best-predicting iris color, we selected sets of SNPs by parsing P values among possible combinations and identified four discontinuous and non-overlapping sets across the LD blocks (p-Selected SNP sets). In a second, partially overlapping sample of 1,072, samples with matching diplotypes comprised of these p-Selected OCA2 SNPs exhibited a rate of C concordance of 96.3% (n = 82), which was significantly greater than that obtained from randomly selected samples (62.6%, n = 246, P<0.0001). In contrast, the rate of C concordance using diplotypes comprised of the 11 identified hap-tags was only 83.7%, and that obtained using diplotypes comprised of all 33 SNPs organized as contiguous sets along the locus (defined by the LD block structure) was only 93.3%. These results confirm that OCA2 is the major human iris color gene and suggest that using an empirical database-driven system, genotypes from a modest number of SNPs within this gene can be used to accurately predict iris melanin content from DNA.
Variable iris pigmentation in humans is the result of variable distribution of pigment granules among a fixed number of stromal melanosomes (Imesch et al. 1997). The pigment, melanin, is a light-absorbing biopolymer synthesized and stored in the melanosomes of melanocyes. Different iris colors and patterns are probably a function of melanin synthesis levels and developmentally programmed melanocyte distribution patterns. Studies of oculocutaneous mutants (both man and model systems) show that complete demelanization is associated with a pink or red iris color due to the passage of white light through the cornea and depigmented iris, and reflection from internal capillaries on the retina. In normal subjects, specific wavelengths of the white light are scattered to the surface as a function of the composition of the melanosomes as well as their packaging or distribution (Prota et al. 1998; Sturm and Frudakis 2004). Uniform distributions of melanin-rich melanosomes give rise to darker iris colors due to high absorption and little reflection of light to the surface. Regions of thinly distributed melanosomes with less melanin give rise to lighter colors due to less absorption and more reflectance of blue and green wavelengths to the surface. Melanin in the iridial stroma is synthesized from tyrosine and takes two forms—eumelanin (brown pigment) and when cysteine is present, pheomelanin (red-yellow pigment). Various iris colors associated with various ratios of the two; green irides are associated with pheomelanin, brown irides with eumelanin, mixed color eyes with a mix of the two and blue irides with very low levels of either. The distribution of colors is complex some greens are brighter than others, some browns show a red tint while others do not, and some irides of blue color appear as dark irides from a distance due to a darker hue. The patterning of iris colors from individual to individual is equally complex. Blue irides often have peripupillary rings or sectors of brown, red or yellow and many irides that appear to be of a light color from a distance are revealed upon closer inspection as a mix of light and dark regions.
The genetics of iris color has long interested teachers and basic biologists, but has been largely ignored by research geneticists. Part of the reason for this is difficulty measuring melanin content of the iris. Though skin and hair pigmentation phenotypes are notoriously complex (Akey et al. 2001; Bito et al. 1997; Sturm et al. 2001; Box et al. 1997, 2001), the inheritance of iris color is quasi-Mendelian. The iris color of offspring can usually be predicted given those of the parents, but there are numerous examples of parents with blue irides producing offspring with brown or partially brown irides (Sturm et al. 2001; Brauer and Chopra 1978). This inheritance pattern seems to suggest a small number of genes at work, with one of these probably explaining most of the variation. Indeed, early pedigree studies suggested that natural variation in iris color is a function of variation at two loci—a major locus responsible for pigmentation of the iris but not the skin or hair, and another pleitropic gene controlling pigmentation levels in all tissues simultaneously (Brues 1975). Study of oculocutaneous albinism in both man and model organisms since has revealed that a small family of pigmentation genes are required for normal pigmentation of the iris, most known or thought to be directly or indirectly involved in the biosynthesis of eumelanin from tyrosine (Sturm et al. 2001; Durham-Pierre et al. 1994, 1996, Gardner et al. 1992; Hamabe et al. 1991; Chintamaneni et al. 1991; Abbott et al. 1991; Boissy et al. 1996; Robbins et al. 1993; Smith et al. 1998; Flanagan et al. 2000; Ooi et al. 1997). However, until recently little work had been performed at identifying the determinants of natural (non-disease based) variation of iris colors. Publication of the human genome map and the proliferation of genome screening tools enabled the first of these screens just recently, and a few studies have implicated one human pigmentation gene, OCA2, as the primary determinant of naturally occurring iris colors in humans (Sturm and Frudakis 2004). With a linkage scan, Eiberg and Mohr (1996) localized a brown-iris locus to an interval containing the OCA2 and MYO5A genes on chromosome 15. Zhu et al. (2004) used sib-pairs to identify a massive LOD peak (LOD = 19.2) for a chromosomal interval containing the OCA2 and MYO5A loci but no significant peaks on other chromosomes, and these authors concluded that a gene within this interval (most likely OCA2) explained about 75% of iris color variation. This result was later reproduced by others who obtained a LOD = 2.9 over this region (Posthuma et al. 2006). Frudakis et al. (2003) showed that of 16 pigmentation genes screened, by far the most numerous and consistently strong SNP allele associations with respect to iris colors and shades measured in different ways were found at OCA2 (in contrast to those from MYO5A), suggesting that the aforementioned linkage results were due to the influence of the OCA2 gene. As this paper was being written, other authors demonstrated that diplotypes from three OCA2 SNPs are sufficient to explain most of the variation in human iris color (Duffy et al. 2007).
OCA2 is the human homologue of the mouse pink-eyed (p) dilute mutation (Sturm et al. 2001; Rinchik et al. 1993; Brilliant 2001; Oetting et al. 1998) and the OCA2 protein product is distributed in the melanosomal membrane. Though the function of OCA2 has not yet been determined, it shares structural similarity with ion transporters (i.e. the E. coli Na+/H+ anti-porter), possibly indicating that it is involved in pH regulation (Puri et al. 2000). pH is thought to be important for the activity of tyrosinase (TYR), which is integral the both the eumelanin and pheomelanin biosynthesis pathways. It has been suggested that because p mice show perturbed eumelanin with normal pheomelanin production, OCA2 is involved only in eumelanin anabolism but this conclusion may be simplistic. Because TYR catalyzes two separate biochemical reactions in the eumelanin pathway and only one in the pheomelanin pathway it may be that certain OCA2 mutations affect the two related pathways disproportionately. The OCA2 gene is characterized by remarkable locus diversity—over 1,129 OCA2 SNPs are known to exist. The purpose of the work reported herein was to more fully characterize the association between OCA2 polymorphism and iris pigmentation. From a survey of 271 OCA2 SNPs, using a more objective phenotyping system, we confirmed the top six OCA2 associations we had previously reported (Frudakis et al. 2003) and identified an additional 27 associated SNPs spanning the OCA2 locus. We show that when alleles of these SNPs are assembled into multilocus haplotypes (diplotypes), the iris color scores among samples with matching genotypes are over 96% concordant. The level of concordance for new samples not used in the screening step was similar to that for samples used to identify the SNP associations. Our results suggest that OCA2 diplotypes can be used for predicting iris color from DNA with good accuracy.
Samples were collected from volunteers at scientific conferences and students as well as local residents using informed consent and a basic biographical questionnaire for the self-reporting of ethnicity. Our survey area extended throughout the continental US. Cryptic population structure differences among subjects of different phenotype can contribute towards type I error, which has been suspected as a problem with SNP-based association studies in the past (Hoggart et al. 2003; Terwilliger and Goring 2000). We attempted to minimize the contribution of cryptic population stratification by using samples from individuals that self-reported as “Caucasian”, or similar by screening against the use of terms “African”, “African American”, “Asian” or “Indigenous American”. Then, as have others before us working on a variety of diseases (Fernandez et al. 2003; Molokhia et al. 2003; McKeigue 1997; Zhu et al. 2005; Reiner et al. 2005) and other phenotypes (Parra et al. 2004; Shriver et al. 2003; Smith et al. 2004; Yang et al. 2005; Bonilla et al. 2004), we estimated individual genomic ancestry admixture for each sample. These estimates serve two purposes—eliminating samples from the analysis with high non-European admixture eliminates elements of potentially confounding structure (particularly since iris colors are unequally distributed among world populations), and after identifying associated OCA2 SNPs, we can use the estimates to distinguish those that are in linkage disequilibrium (LD) with phenotypically active loci from those that are merely markers of elements of residual population structure (i.e. lower levels of admixture) correlated with phenotype (Rosenberg et al. 2005; Hoggart et al. 2003; Choudhry et al. 2006; Parra et al. 1998; McKeigue et al. 2000; Pfaff et al. 2001). Maximum likelihood estimates of individual biogeographical ancestry (BGA) admixture were determined using methods described previously by others (Bernstein 1931; Chakraborty 1986) and 176 ancestry informative markers (AIMs) (Halder et al. 2006; Frudakis et al. 2003; Frudakis 2005; Shriver and Kittles 2004) chosen from the genome based on their information for a continental, 4-population model. For describing elements of this model, we found the terms “European” (genetic ancestry shared among Europeans, Middle Eastern and to a lower extent South Asians), “sub-Saharan African”, “East Asian” and “Indigenous American” (genetic ancestry shared among American Indians, Latin and South American Indians and certain Central Asian populations) were convenient. The choice of a 4-population model was based on previously published hypothesis-free cluster analyses of world-wide populations (Rosenberg et al. 2002; Shriver et al. 2005), and the nomenclature of the parental populations is reflective of modern populations distributions—not necessarily the distributions of the original parental populations that existed from 15,000–50,000 years ago.
It is suspected by many that results from association scans have been difficult to replicate due in part to imprecise definition of phenotypes, resulting in the measurement of numerous covariates simultaneously (Terwilliger and Goring, 2000). We therefore invested substantial effort into phenotyping our subjects in a manner best suited for testing our hypothesis that variable OCA2-mediated melanin production was the primary determinant of human iris colors. We used digital photography and in-silico spectrophotometry to quantify the eumelanin content of each iris irrespective of its iridial patterning in terms of an Iris Color Score (“C”, see Supplementary Materials). The C value combines the luminance, red, green and blue color reflectance values from four quadrants of each iris into a single variable. Darker irides with greater eumelanin content are characterized by lower C values and lighter irides with less eumelanin are characterized by higher C values.
Genotyping was carried out using single base primer extension and a Beckman SNPstream instrument (Beckman-Coulter, Fullerton, CA, USA). Amplification primers and multiplexes thereof were designed using Beckman’s Autoprimer software program (http://www.autoprimer.com). Reaction conditions were as described per recommended conditions for the SNPstream, however Hot Master taq was used in all PCR reactions.
Allele frequencies were derived from counts of genotype i using the function pi = (xi/2n), where xi is the number of times that the genotype or haplotype i was observed and n is the number of patients in the group. Haplotypes were inferred using the PHASE program (Stephens et al. 2001; Stephens and Donnelly 2003). Allele and genotype associations were analyzed using a 2 × 2 contingency table with a two-tailed Fisher’s Exact Test (GraphPad Software Inc.) or using multiple regression with MedCalc for windows Statistics for Biomedical research software version 9, as indicated. Regression of admixture on iris color scores was also carried out using MedCalc software. Multiple regression coefficients of determination were calculated from models incorporating the indicated sets of SNPs as independent variables, except for the analysis of all 33 SNPs simultaneously. Due to the large number of variables, the coefficient for all 33 SNPs was estimated by adding those for the contiguous SNP sets defined by the LD block structure, as described in the text. Power calculations were performed using software provided by Purcell and colleagues (Purcell et al. 2003) (Genetic Power Calculator) with observed values for allele frequencies, phenotype prevalence (0.5 for either light or dark irides), genotype relative risk (for the associated phenotype) and D’ (1.0, since the OCA2 gene is likely to be the main iris color pigmentation gene). Relative risk and odds ratios with their 95% confidence intervals were calculated using the Odds Ratio Generator for Windows (Devilly 2005). LD and Hardy–Weinberg Equilibria were calculated using HAPLOVIEW software (Barrett et al. 2005) and confirmed using Arlequin software (Schneider et al. 2000). For LD calculations we used 3,000 permutations and 10 initial conditions and for HWE calculations we used 100,000 steps in the Markov chain and 1,000 dememorisation steps. LD and haplotype block analysis for the identification of hap-tag SNPs was accomplished using the program HAPLOVIEW (Barrett et al. 2005).
Iris color phenotyping
Iris color parameter statistics (n = 1,072)
OCA2 has already been implicated as the locus underlying most variation in human iris colors. Our aim here was to more comprehensively mine the OCA2 locus for iris color information, and identify a compliment of markers that could be used to predict iris color from DNA. A preferred approach would rely on hap-tags from the hap-map project but at the time this work commenced, hap-tags for the OCA2 locus were not yet available. Thus, we chose to apply a “shotgun approach” for identifying OCA2 SNPs associated with iris color—that is, picking SNPs randomly to cover the locus evenly with respect to map position. We then mapped the haplotype and phenotype structure of the locus with respect to the associated SNPs and identified the most parsimonious complement fully predictive for iris color.
We first examined 147 biallelic OCA2 SNPs randomly selected from the NCBI:dbSNP database, each with a validated population allele frequency. We focused on higher frequency (minor allele freq. > 5%), rather than on coding SNPs and the selected 147 candidates covered the length of the locus. We broke a sample of 1,317 subjects into two eye color score groups—those with color scores above the mean value of 2.07 (including blue, green, blues and greens with small brown rings/sectors/flecks, etc.) and those below (the darker irides; hazels, blues with thick brown rings, solid browns, etc.). We then used the Fisher’s exact test to identify alleles with unequal distribution among these two groups, considering either association at the level of the allele or at the level of the genotype. The most strongly associated SNPs (marginally associated, n = 22) are presented in Table S1a (OCA2-A-1 through OCA2-C-6; Supplementary Material). The odds that a particular genotype would be found in an individual of a color type with which the genotype was associated (color type and genotype indicated in column 4, Table S1a) ranged from 1.5 to 18 and for most loci, the ratio was significantly above 1.0 (Assoc’d Genotype Odds Ratio, column 8 Table S1a). Regression analysis showed the associations with quantitative iris color scores were stronger than those obtained from contingency analysis based on discrete iris color bins (column 7, Table S1a). Since the SNPs are located within the same gene, which we already know is linked with human iris color, a correction for multiple tests was not appropriate. Six of the SNPs were previously described as associated with iris color (Frudakis et al. 2003, column 6, Table S1a) and their P values were of similar strength as previously reported (column5, Table S1a). Due to the relatively high minor allele frequencies, our study was well powered to detect these strong associations—for most we had a power well over 90% (column 9, Table S1a). Among the 1,317 samples, we identified 88 multilocus SNP genotypes present for more than one sample. For 48 of these, the samples possessed concordant iris colors (range of C across all samples with the genotype ≤1.0) and for the remaining 40, the samples exhibited iris color score discordance (range of C across all samples with the genotype >1.0). Given our hypothesis that common OCA2 SNPs explained most of the crude aspects of iris color variation (e.g. melanin content), this result suggested there remained other SNPs associated with iris color yet to be identified.
To identify these SNPs most efficiently, we searched a second set of 124 randomly selected OCA2 SNPs not previously tested in the first screen. We screened these 124 SNPs for conditional associations within the discordant genotype groups just discussed. That is, we were searching for SNPs that helped resolve the iris color score discrepancies we observed within the 40 multilocus genotype groups. An additional set of 11 associated SNPs were found (OCA2-D, Table S1b). Genotypes for each resolved discordant iris colors within at least one confounded multilocus genotype group and genotypes for several resolved discordant iris colors within multiple confounded multilocus genotype groups (Column 8 “Confounder Groups Resolved”, Table S1b). Alleles for six of these 11 were marginally (independently) associated with iris color scores as quantitative variables as well as binned iris color scores (above and below C = 2.2), but alleles for five were only associated with the former (Table S1b). This hierarchical screening approach allowed us to first discover numerous SNPs associated with iris color scores from our first screen, identify the gaps in iris color information content for this set (that would be required for accurate inference of iris color score), and then specifically fill those gaps from SNPs selected in a secondary screen.
Multiple regression analysis shows that most of the 33 associated OCA2 SNPs provide iris color information over and above that from previously published associations (Frudakis et al. 2003) or the hap-tag subset discussed in the text
(A) Frudakis et al. (2003) set + each SNP
(B) Hap-tags + each SNP
In our sample, homozygotes and the heterozygotes were observed for each of the 33 SNPs. Though departures from HWE are well-tolerated by PHASE (Stephens et al. 2001) we tested each SNP for HWE and found that alleles for 10 of the SNPs were determined not to be in HWE in our predominantly European sample (Supplementary Material, Table S2, those with P < 0.05, column 4). Each of these 11 cases was a function of an overrepresentation of homozygotes, rather than heterozygotes. For three of these, the genotype associated with lighter (higher) color scores was overrepresented (OCA2-B-3, OCA2-C-3, OCA2-D-2) but for the remaining seven, the “darker” genotype was overrepresented. Genotyping error is a potentially trivial source of such an observation, but genotypes for each of these 11 SNPs routinely passed Beckman UHT quality control checks, and the number of ambiguous genotype calls were no greater for this set of SNPs than for those in HWE. With the Beckman UHT, the reliability of the genotype calls is inversely proportional to the distance from its assigned genotype cluster (in terms of a two-dimensional plot of allele X and Y signal), which is measured with a “LOD” score. The compactness of the genotype clusters is measured by averaging the “LOD” variables for calls within and across the clusters. When genotyping problems are encountered, lower LOD values obtain, indicating dispersed clusters that merge into one another. The tightness of the genotype clustering for those SNPs not in HWE (average LOD = 5.29) was not significantly different from those that were in HWE (average LOD = 5.16).
inferred haplotypes for all 1,072 samples,
assembled the samples into OCA2 diplotype groups,
calculated the iris color score concordance among samples of shared diplotypes.
Iris color concordance within multilocus OCA2 genotype groups using two different evaluation criteria
Matched samples (N)
%within predicted range
R = ±1.0a (I = 23)b
R = ±0.9a (I = 20)b
Random set 1
Random set 2
Random set 2
Total random set
A second concordance assessment was made using all 33 SNPs ordered into contiguous blocks demarcated by the boundaries of LD structure across the locus. We broke the 33 SNPs into three sets (which we will call “contiguous” blocks)—the first covered haplotype blocks 1, 2 and 3 from the analysis of Fig. 2 (SNP 1: OCA2-B-1 contiguously through SNP 12: OCA2-B-3, multiple regression R2 = 0.0865 with iris color scores), the second covered haplotype blocks 4, 5 and 6 from the analysis of Fig. 2 (SNP 13: OCA2-D-5 contiguously through SNP 25: OCA2-C-5, multiple regression R2 = 0.0848 with iris color scores), and the third set covered the remaining SNPs from SNP26: OCA2-B-5 through SNP 33: OCA2-D-11. The diplotypes for SNPs organized this way revealed a multiple regression R2 = 0.2342 with iris color scores. Using an R = 1.0 about the point estimate of C for each iris, the iris color scores for samples with the same OCA2 diplotypes across all three of the contiguous blocks were 93.3% concordant. Values for model and validation samples were similar with respect to the sample sizes available with matches (91.3 and 100.0%, respectively) (Table 3, “contiguous” rows). Similar but lower rates of iris color score concordance was observed using a range R = 0.9 of iris color scores (Table 3).
The final assessment involved partitions of the 33 associated SNPs into blocks that were empirically determined to be of the highest predictive value for iris color. Among the 22 SNPs identified from the first screen, we assembled each possible (2, 3, ..., 11) SNP combination, inferred diplotypes with respect to each, and determined the Fisher’s P value for association with binned iris color scores. Repeating this process for each possible set of combinations, we determined that haplotypes for the sets defined by the groupings in Table S1a (OCA2-A: 11 SNPs, OCA2-B: 6 SNPs, OCA2-C: 6 SNPs) provided the lowest combined Fisher’s P values. To these three sets of SNPs we added a fourth set comprised of those in Table S1b (OCA2-D: 11 SNPs), which were originally selected because they provided information in addition to those of the optimally associated OCA2-A, B and C groupings. We will term these SNP groupings as “p-Selected”. Using an R = 1.0 about the point estimate of C for each iris, we observed the iris color for 96.3% of the samples with matching p-Selected diplotypes fell within the predicted range (P < 0.0001), with a similar rate for Model (96.4%, P < 0.0001) and Validation samples (96.2%, P = 0.0018, shaded region, Table 3). In contrast, the average frequency with which the iris color of randomly selected samples fell within these same ranges was only 62.6% (Table 3) and this difference was highly significant (P < 0.0001). Using an R = 0.9 about each point estimate, we observed the iris color for 92.7% of the samples with matching p-Selected diplotypes fell within the predicted range and the rate with which the iris color of Model samples fell within the predicted range (92.9%) was similar to that with which the iris color of Validation samples fell within the predicted range (92.3%) (Table 3). In contrast, the frequency with which the iris color of randomly selected samples fell within this particular range was only 56.1% (Table 3) and this difference was highly significant (P < 0.0001). Multiple regression models using iris color score as the dependent variable revealed Coefficient of Determinations (CODs) suggesting that the p-Selected SNP sets combine to explain a significant amount of variation in human iris color scores (OCA2-A COD = 0.1874, OCA2-B COD = 0.1247, OCA2-C COD = 0.0660, OCA2-D COD = 0.2544).
Gallery display of iris color inferences
Associations independent of population structure
Previous work has implicated OCA2 as the primary locus underlying variable iris pigmentation. Though a number of previous OCA2 SNP associations have been identified (Frudakis et al. 2003; Duffy et al. 2007), and though most of the variation in iris colors is accounted for by OCA2 polymorphism (Zhu et al. 2004; Duffy et al. 2007), our understanding of the genetic basis of iris color has remained incomplete as evidenced by our inability to accurately predict even crude aspects of iris color from OCA2 SNP genotypes. This inability is a function of a few basic problems: (1) few of the polymorphisms are completely dominant or seem to exert their influence in a context-independent manner, (2) iris color is difficult to phenotype objectively, (3) the OCA2 locus has not been comprehensively screened for iris color information and the most important SNPs may not yet be identified, and (4) the genetic rules for how OCA2 gene variants influence iris color is not yet understood. In addition, until now it had not been proven that OCA2 alleles are truly associated with iris color as opposed to correlated with iris colors through their ancestry information. Herein, we attempted to address each of these problems simultaneously. We developed a more objective phenotyping system for iris color based on iris color scores—an in-silico method for simplifying an extraordinarily complex phenotype and specifically estimating the melanin content of the iris (irrespective of how it was patterned). Using these iris color scores, we performed the most comprehensive OCA2-iris color SNP screen so far described (271 SNPs), confirming some previously described OCA2 associations (Frudakis et al. 2003; Duffy et al. 2007) and identifying several new ones. We qualified our analysis with respect to cryptic continental level population structure and lastly, we developed a diplotype-based classification system for the inference of iris color that does not require a prior understanding of the phenotype inheritance rules. From this multi-faceted approach we made a number of interesting observations. P values for the SNP associations were generally stronger with iris color as a quantitative variable compared to those obtained through contingency analysis with binned iris colors (based on these same quantitative scores). Indeed, some of the SNPs we identified (e.g. OCA2-D-8, OCA2-D-9) were only associated with iris color as a quantitative variable and these results suggest that the parameterization of iris color enhanced the power of our study. The associations we identified were shown to be independent from those previously described, and independent of the correlation between continental BioGeographical Ancestry admixture and iris color. Iris colors are unevenly distributed among continental populations, and given the correlation between admixture and iris color we observed, it was incumbent on us to prove that our associations were independent of such admixture. Having done so, our analysis represents the first formal demonstration that the OCA2 locus is associated with human iris colors per-se. The accuracy of our classification system was shown to depend on the marker set used—with a hap-tag subset performing the worst, contiguously arranged SNPs next best and empirically identified SNP sets performing best. This order of performance mirrored the ranking in coefficient of determination (R2) among these sets, suggesting that our classification results were specifically a function of iris color information content. Overall, our results confirm and extend observations from those before us that OCA2 is a primary iris color gene and suggest that certain OCA2 polymorphisms are necessary and sufficient as predictive markers for the crude aspects of iris color (namely, overall melanin content). We were able to pinpoint the iris color of test samples to within a range of values covering about half of the observed population range. This is essentially equivalent to being able to infer the shade, or lightness of the iris from DNA (but not necessarily the specific pattern or precise color). This is only a first step, and further refinements will clearly require the identification of additional OCA2 polymorphisms, polymorphisms in other genes and/or environmental factors that influence the expression and patterning of specific iris colors.
It has not yet been demonstrated that the polymorphisms we have described are functionally relevant—that is, they represent phenotypically active variants. Indeed the same can be said of the OCA2 gene association itself. However our results here combine with a preponderance of evidence from prior work to suggest that OCA2 does indeed represent the major human iris color gene. First, as shown herein, the associations are bona-fide associations with the iris color phenotype, not correlated population structure, and sufficient to predict iris color shade with excellent accuracy. Second, we know that mutations in OCA2 cause human albinism, including some forms that influence iris, but not hair or skin pigmentation (Boissy et al. 1996; Brilliant et al. 2001; Oetting et al. 1998; Rinchik et al. 1993). This latter observation is relevant because natural iris colors are inherited independently from hair and skin pigmentation phenotypes. Third, previous linkage and association studies have implicated OCA2 and the region of chromosome 15 containing OCA2 as explaining most of the variation in naturally occurring iris colors (Eiberg and Mohr 1996; Zhu et al. 2004; Posthuma et al. 2006). Fourth, the OCA2 locus is by far the most polymorphic of the known human pigmentation genes, which seems to fit with the quasi-Mendelian inheritance of iris color Alternatively, the implication of a relatively simple gene (short, fewer polymorphisms) as the primary determinant for iris color would not have been as consistent with the quasi-Mendelian nature of iris color inheritance.
Database matching for classification
It is relatively uncommon for human SNP associations to be so fundamentally and profoundly tied to an element of phenotype expression that the phenotype is adequately predicted from SNP genotypes alone. This may be due in part to the fact that most association studies are not aimed at enabling phenotype prediction, and so rarely proceed to such an in-depth dissection of the phenotype–genotype relationship as we have executed here. The methods we used to predict the gross aspect of iris color from OCA2 diplotypes may be of interest to geneticists focused on a variety of problems and systems because the relationship between specific genotypes and complex and/or quantitative phenotypes has implications for better understanding not only the genetic architecture of the phenotype, but its evolutionary history. Prior to the human genome era, the prediction of complex phenotypes from genetic measurements had not been possible. The introduction of DNA microarrays broke this barrier and enabled the detection of highly characteristic and predictive RNA expression signatures for various phenotypes. In ovarian cancer alone, predictive gene expression signatures have been identified as features of tumor metastasis (e.g. Ramaswamy et al. 2003), malignancy (e.g. Ouellet et al. 2005), tumor drug resistance (e.g. Helleman et al. 2006) and cancer prognosis (Bild et al. 2006; Meinhold-Heerlein et al. 2005). However, genotype (i.e. DNA) “signatures” of multifactorial human traits have been far more elusive—probably because DNA is less directly related to phenotype than RNA. With our problem, we knew that the inheritance of iris color is relatively complex, and though we did not understand a priori the context dependence of the associations we had identified, we could clearly discern that predicting iris color from single or even small sets of highly associated OCA2 SNPs was not possible. Unfortunately, the number of variables produced from phasing all 33 SNP genotypes (1,018 multilocus genotypes) at our disposal was about the same as the number of samples (1,072), and on the surface, it would seem that any attempts at classification using such a large set would be challenging. However, this type of challenge has not prevented the successful classification of disease subtypes, drug response and prognosis using RNA expression signatures. We reasoned that if the OCA2 diplotypes we had identified unambiguously specified the most grossly observable aspect of iris color (eumelanin content), either directly through phenotypic activity or indirectly through linkage with the phenotypically active OCA2 variants, then the empirical database-matching method we described should perform well regardless of the size of the database. That is, the only penalty for a small database is that fewer “predictions” or “tests” of the hypothesis could be executed. The specificity of our results with iris color suggests that our reasoning was correct, and indicates that for some phenotyes, the use of databased diplotypes can constitute a stand-alone system for the inference of complex (multifactorial, quantitative) phenotypes. The database matching method we have described is expected to be robust to various (and still unknown) components of iris color genetic complexity, such as those one might expect to encounter within large and/or highly polymorphic genes. For example, it is possible that haplotypes with a region marginally associated with “blue” irides and a region marginally associated with “brown” irides together beget “green” irides in an additive sense, or beget “brown” irides in an epistatic sense. Neither the epistatic nor dominance effects of a locus need be universal in all haplotype contexts, and predicting the myriad of potentially complex classification “rules” using summary statistics or hap-tag SNPs seems challenging for such a complex gene, if not impossible. Yet each is theoretically accommodated by a database-matching system even if they are not yet defined, and no matter how small the database (or large the number of variables). Of course, the method is the most basic imaginable, and while it enjoys an empirical power not usually associated with methods that rely on generalizations based on population averages, it suffers from the requirement of very large sample sizes if one hopes to be able to classify most new samples encountered (particularly for large genes and/or systems involving many genes). For example, as a result of the current size of our database (n = 1,072), we could only “classify” about 8% of the iris colors in our sample (82 with p-Selected OCA2 diplotype matches out of 1,072 total samples). If we assume that our results extend to other samples, as our validation sample results suggest they do, then improving on this rate is simply a matter of building the database. In addition to increasing the number of shared OCA2 diplotypes, adding samples is expected to increase the number of observed haplotypes as well, but we expect a finite number of haplotypes and diplotypes to exist in nature and at some point the complexity of the database is expected to plateau. Indeed, an earlier instance of this database compiled with n = 835 samples revealed 64 OCA2-A, 37 OCA2-B and 18 OCA2-C haplotypes which is a similar level of haplotype complexity observed in the current sample of 1,072 (70 OCA2-A, 38 OCA2-B and 19 OCA2-C haplotypes), indicating that with 1,072 samples we are already well along the plateau phase. Thus, rather than merely increase the complexity of the classification system, the addition of samples to our database from this point in its development onward is expected to increase the number and population of shared OCA2 diplotypes, enabling us to classify a larger fraction of new samples. Notwithstanding, there are some who believe that any calculation of classification accuracy must involve all samples, not just those one feels comfortable or able to classify. With this criteria, the classification of all but the simplest of phenotypes from DNA is likely to wait many years until the cost per genotype comes down and research groups can afford to build databases tens of thousands of samples strong. However, we have ample evidence that focusing technology on subsamples is not only acceptable in a theoretical sense but highly effective in a practical sense as well. For example, the drug HerceptinR, is used to treat only a small fraction of breast cancer patients (those with Her2 positive cancers) because it performs well in that subpopulation. Had HerceptinR efficacy calculations been mandated in the clinical trial to include all breast cancer patients, the drug would likely not have been approved and papers on its effectiveness summarily rejected.
OCA2 phenotype complexity and demographic history
The difficulty in predicting the iris color of offspring from that of the parents is the primary evidence that the inheritance of iris color is complex. The continuous distribution of melanin content we observed is additional evidence of the complexity of the phenotype and belies the commonly held notion that iris colors are discrete. The bimodal nature of the distribution could be due to dominance but it may also be due to positive assortative mating. In support of this idea, alleles for about one-third of the SNPs we described (10/33) were found not to be in Hardy–Weinberg proportions (HWE), and our results suggested this was unlikely to be due to genotyping error or ambiguity. Assortative mating is a primary cause alleles for a polymorphism may not be in HWE (Cavalli-Sforza and Bodmer 1999), and it is not hard to imagine this force at work shaping the allelic distribution of a gene involved in iris pigmentation. If so, we might expect to find that iris colors are unequally apportioned among various European sub-populations. Indeed, studies using Eurasian Ancestry Informative Markers have demonstrated an association between sub-European population structure and iris color and along a northwestern to southeastern axis (Frudakis 2007). That is, lighter irides are more commonly found in Northwestern and Continental Europe whereas darker irides are more commonly found among individuals of genetic ancestry common in the Middle East and South Asia (regardless of what nations their recent ancestors were derived, Frudakis 2007). There is evidence for positive selection at the OCA2 locus within Europeans and Asians but not Africans (Voight et al. 2006; Lao et al. 2007) and lighter pigmentation phenotypes are generally recognized to represent a derived state from a more pigmented ancestral state (Lao et al. 2007). Thus, we might expect the overrepresented genotype for each of our 11 non-HWE SNPs to be associated with blue, rather than brown iris colors. However, we were surprised to note that for 64% of non-HWE SNPs (7/11), it was the genotype associated with darker iris colors that was overrepresented. This could be the result of differential OCA2 exchange rates; the integration of lighter-irises into populations with dark irides in Southeastern Europe, the Middle East and South Asia may have been less historically common than the reverse. Alternatively, or perhaps in addition, the disequilibrium may be the result of higher fertility rates among European and Eurasian populations with darker iris colors, reminiscent of the fertility disparity among native and immigrant populations within continental Europe today.
It is interesting that the most informative diplotypes comprised SNP sets that had to be empirically determined (the p-Selected sets) and were not necessarily related to one another by distance or LD within the locus. It is also interesting that diplotypes comprised of the hap-tag SNPs were not as predictive as diplotypes covering all 33 of our associated SNPs, whether ordered along the chromosome (such as our contiguous sets) or not (such as the p-Selected sets). The fact that most of the associations we described were independent from those of the hap-tagged SNPs, that large SNP sets were required to “explain” iris color in a classification sense, and the large number of SNPs at the OCA2 locus in general seems to suggest that over the past 50,000 years as populations expanded out of Africa and the Middle East into Europe and Eurasia, a high degree of functional allelic complexity has been maintained. Perhaps as a result of dense population structure amalgamation within Eurasia. The large number of phenotypically-relevant and/or associated OCA2 haplotypes seems to confirm that attempting to reduce even the crude elements of a complex phenotype (e.g. apparently multifactorial and/or of complex inheritance rules) to hap-tagged SNPs is not always the best approach if the goal is to predict phenotype from DNA sequences, since doing so may ignore a significant portion of locus diversity. Indeed, while this manuscript was in review Duffy et al. 2007 described a screen of 58 exonic and tagging OCA2 SNPs, finding three that explained most of the variation in visually assessed iris colors, though these authors did not attempt to classify unknown irides based on the corresponding genotypes. Two of these three SNPs are among the set we describe here (OCA2-D-10 and OCA2-D-11), but while Duffy et al. scanned the entire OCA2 locus they probably did not identify most of those we describe here because they focused on a smaller number of tagging and exonic SNPs (n = 58). In contrast, our work, was a shotgun-style association scan and imposed no assumptions about the relationship between gene location and functional relevance (indeed, most of the SNPs we describe are located within introns). However, while there may be evidence that the history of iris color evolution is rich with detail and complexity, there is also evidence that the complexity of lighter colored irides pales in comparison to that for darker colored irides. For example, of the 82 p-Selected OCA2 diplotypes shared by more than one sample (i.e. the most common diplotypes), 62 (76%) specified a range of color scores that fell predominantly in the lighter end of the spectrum. This is a by-product of greater diversity of darker multilocus genotypes compared to lighter genotypes in our database. That is, there are fewer “blue” OCA2 sequences, and individuals with blue irises therefore tend to have the same genotypes more often than individuals with brown irides. This observation fits with our expectations, based on our current understanding of the recent origins of the world’s populations, and observation that lighter iris color haplotypes represent fairly recent, derived sequences of older, darker haplotypes. Further work on the phlyogeography of the haplotypes we have described could therefore help illuminate aspects of human expansions and migrations out of Africa that have heretofore been occluded by admixture, and reverse gene flow.
We thank Shannon Boyd and Sara Barrow for assistance with genotyping and Mark Shriver and Marc Bauchet of the Pennsylvania State University for assistance in collecting samples. Our work was supported with private funds.
- Bernstein F (1931) Die Geographische verteilung der blutgruppen und ihre anthropologische bedeutung. in Comitato Italiano per lo Studio dei Problemi Della Populazione. Roma Instituto Poligrafico dello stato. pp 227–243Google Scholar
- Boissy R, Zhao H, Og W, Austin L, Wildenberg S, Boissy Y, Zhao Y, Sturm R, Hearing V, King R, Nordlund J (1996) Mutation in and lack of expression of tyrosinase-related protein-1 (TRP-1) in melanocytes from an individual with brown oculocutaneous albinism: a new subtype of albinism classified as ‘OCA3.’ Am J Hum Genet 58:1145–1156PubMedGoogle Scholar
- Cavalli-Sforza L, Bodmer W (1999) The genetics of human populations. Dover, Mineola pp 45–59Google Scholar
- Choudhry S, Coyle N, Tang H, Salari K, Lind D, Clark S, Tsai H, Naqvi M, Phong A, Ung N, Matallana H, Avila P, Casal J, Torres A, Nazario S, Castro R, Battle N, Perez-Stable E, Kwok P, Sheppard D, Shriver M, Rodriguez-Cintron W, Risch N, Ziv E, Buchard E (2006) Population stratification confounds genetic association studies among Latinos. Hum Genet 118:652–664PubMedCrossRefGoogle Scholar
- Devilly G (2005) The odds ratio generator for windows: version 1.0 (computer programme). Centre for Neuropsychology, Swinburne University, AustraliaGoogle Scholar
- Frudakis T (2005) Powerful but requiring caution: genetic tests of ancestral origins. Nat General Soc Quart 93:260–268Google Scholar
- Frudakis T (2007) Molecular Photofitting: the inference of phenotype from DNA. Elsiever/Academic, New York (in press)Google Scholar
- Halder I, Shriver M, Thomas M, Fernandez J, Frudakis T (2006) A panel of ancestry informative markers for estimating individual biogeographical ancestry and admixture from four continents: utility and applications. Hum Genet (In review)Google Scholar
- Lao O, de Gruijter J, van Dujin K, Navarro A, Kayser M (2007) Signatures of positive selection in genes associated with human skin pigmentation as revealed from analyses of single nucleotide polymorphisms. Ann Hum Genet 71:354–369Google Scholar
- Meinhold-Heerlein I, Bauerschlag D, Hilpert F, Dimitrov P, Sapinoso L, Orlowska-Volk M, Bauknecht T, Park T, Jonat W, Jacobsen A, Sehouli J, Luttges J, Drajewski M, Krajewski S, Reed J, Arnold N, Hampton G (2005) Molecular and prognostic distinction between serous ovarian carcinomas of varying grade and malignant potential. Oncogene 34:1053–1065CrossRefGoogle Scholar
- Molokhia M, Hoggart C, Patrick A, Shriver M, Parra E, Ye J, Silman A, McKeigue P (2003) Relation of risk of systemic lupus erythematosus to west African admixture in a Caribbean population. Hum Genet 112:301–308Google Scholar
- Schneider S, Roessli D, Excoffier L (2000) Arlequin ver. 2.000: a software for population genetics data analysis. Genetics and Biometry Laboratory, University of Geneva, SwitzerlandGoogle Scholar
- Shriver M, Parra E, Dios S, Bonilla C, Norton H, Jovel C, Pfaff C, Jones C, Massac A, Cameron N, Baron A, Jackson T, Argyropoulos G, Jin L, Hoggart C, McKeigue P, Kittles R (2003) Skin pigmentation, biogeographical ancestry and admixture mapping. Hum Genet 112:388–399Google Scholar
- Shriver M, Mei R, Parra E, Sonpar V, Halder I, Tishkoff S, Schurr T, Zhadanov S, Osipova L, Brutsaert T, Friedlaender J, Jorde L, Watkins W, Bamshad M, Gutierrez G, Loi H, Matsuzaki H, Kittles R, Argyropoulos G, Fernandez J, Akey J, Jones K (2005) Large-scale SNP analysis reveals clustered and continuous patterns of human genetic variation. Hum Genom 2:81–89Google Scholar
- Smith M, Patterson N, Lautenberger J, Truelove A, McDonald G, Waliszewska A, Kessing B, Malasky M, Scafe C, Le E, De Jager P, Mignault A, Yi Z, The G, Essex M, Sankale J, Moore J, Poku K, Phair J, Goedert J, Vlahov D, Williams S, Tishkoff S, Winkler C, De La Vega F, Woodage T, Sninsky J, Hafler D, Altshuler D, Gilvert D, O’Brien S, Reich D (2004) A high-density admixture map for disease gene discovery in African Americans. Am J Hum Genet 74:1001–1013PubMedCrossRefGoogle Scholar
- Yang N, Li H, Criswell L, Gregersen P, Alarcon-Riquelme M, Kittles R, Shigata R, Silva G, Patel P, Belmont J, Seldin M (2005) Examination of ancestry and ethnic affiliation using highly informative diallelic DNA markers: application to diverse and admixed populations and implications for clinical epidemiology and forensic medicine. Hum Genet 118:382–392PubMedCrossRefGoogle Scholar