Human Genetics

, 122:311 | Cite as

Multilocus OCA2 genotypes specify human iris colors

  • Tony Frudakis
  • Timothy Terravainen
  • Matthew Thomas
Original Investigation

Abstract

Human iris color is a quantitative, multifactorial phenotype that exhibits quasi-Mendelian inheritance. Recent studies have shown that OCA2 polymorphism underlies most of the natural variability in human iris pigmentation but to date, only a few associated polymorphisms in this gene have been described. Herein, we describe an iris color score (C) for quantifying iris melanin content in-silico and undertake a more detailed survey of the OCA2 locus (n = 271 SNPs). In 1,317 subjects, we confirmed six previously described associations and identified another 27 strongly associated with C that were not explained by continental population stratification (OR 1.5–17.9, P = 0.03 to <0.001). Haplotype analysis with respect to these 33 SNPs revealed six haplotype blocks and 11 hap-tags within these blocks. To identify genetic features for best-predicting iris color, we selected sets of SNPs by parsing P values among possible combinations and identified four discontinuous and non-overlapping sets across the LD blocks (p-Selected SNP sets). In a second, partially overlapping sample of 1,072, samples with matching diplotypes comprised of these p-Selected OCA2 SNPs exhibited a rate of C concordance of 96.3% (n = 82), which was significantly greater than that obtained from randomly selected samples (62.6%, n = 246, P<0.0001). In contrast, the rate of C concordance using diplotypes comprised of the 11 identified hap-tags was only 83.7%, and that obtained using diplotypes comprised of all 33 SNPs organized as contiguous sets along the locus (defined by the LD block structure) was only 93.3%. These results confirm that OCA2 is the major human iris color gene and suggest that using an empirical database-driven system, genotypes from a modest number of SNPs within this gene can be used to accurately predict iris melanin content from DNA.

Introduction

Variable iris pigmentation in humans is the result of variable distribution of pigment granules among a fixed number of stromal melanosomes (Imesch et al. 1997). The pigment, melanin, is a light-absorbing biopolymer synthesized and stored in the melanosomes of melanocyes. Different iris colors and patterns are probably a function of melanin synthesis levels and developmentally programmed melanocyte distribution patterns. Studies of oculocutaneous mutants (both man and model systems) show that complete demelanization is associated with a pink or red iris color due to the passage of white light through the cornea and depigmented iris, and reflection from internal capillaries on the retina. In normal subjects, specific wavelengths of the white light are scattered to the surface as a function of the composition of the melanosomes as well as their packaging or distribution (Prota et al. 1998; Sturm and Frudakis 2004). Uniform distributions of melanin-rich melanosomes give rise to darker iris colors due to high absorption and little reflection of light to the surface. Regions of thinly distributed melanosomes with less melanin give rise to lighter colors due to less absorption and more reflectance of blue and green wavelengths to the surface. Melanin in the iridial stroma is synthesized from tyrosine and takes two forms—eumelanin (brown pigment) and when cysteine is present, pheomelanin (red-yellow pigment). Various iris colors associated with various ratios of the two; green irides are associated with pheomelanin, brown irides with eumelanin, mixed color eyes with a mix of the two and blue irides with very low levels of either. The distribution of colors is complex some greens are brighter than others, some browns show a red tint while others do not, and some irides of blue color appear as dark irides from a distance due to a darker hue. The patterning of iris colors from individual to individual is equally complex. Blue irides often have peripupillary rings or sectors of brown, red or yellow and many irides that appear to be of a light color from a distance are revealed upon closer inspection as a mix of light and dark regions.

The genetics of iris color has long interested teachers and basic biologists, but has been largely ignored by research geneticists. Part of the reason for this is difficulty measuring melanin content of the iris. Though skin and hair pigmentation phenotypes are notoriously complex (Akey et al. 2001; Bito et al. 1997; Sturm et al. 2001; Box et al. 1997, 2001), the inheritance of iris color is quasi-Mendelian. The iris color of offspring can usually be predicted given those of the parents, but there are numerous examples of parents with blue irides producing offspring with brown or partially brown irides (Sturm et al. 2001; Brauer and Chopra 1978). This inheritance pattern seems to suggest a small number of genes at work, with one of these probably explaining most of the variation. Indeed, early pedigree studies suggested that natural variation in iris color is a function of variation at two loci—a major locus responsible for pigmentation of the iris but not the skin or hair, and another pleitropic gene controlling pigmentation levels in all tissues simultaneously (Brues 1975). Study of oculocutaneous albinism in both man and model organisms since has revealed that a small family of pigmentation genes are required for normal pigmentation of the iris, most known or thought to be directly or indirectly involved in the biosynthesis of eumelanin from tyrosine (Sturm et al. 2001; Durham-Pierre et al. 1994, 1996, Gardner et al. 1992; Hamabe et al. 1991; Chintamaneni et al. 1991; Abbott et al. 1991; Boissy et al. 1996; Robbins et al. 1993; Smith et al. 1998; Flanagan et al. 2000; Ooi et al. 1997). However, until recently little work had been performed at identifying the determinants of natural (non-disease based) variation of iris colors. Publication of the human genome map and the proliferation of genome screening tools enabled the first of these screens just recently, and a few studies have implicated one human pigmentation gene, OCA2, as the primary determinant of naturally occurring iris colors in humans (Sturm and Frudakis 2004). With a linkage scan, Eiberg and Mohr (1996) localized a brown-iris locus to an interval containing the OCA2 and MYO5A genes on chromosome 15. Zhu et al. (2004) used sib-pairs to identify a massive LOD peak (LOD = 19.2) for a chromosomal interval containing the OCA2 and MYO5A loci but no significant peaks on other chromosomes, and these authors concluded that a gene within this interval (most likely OCA2) explained about 75% of iris color variation. This result was later reproduced by others who obtained a LOD = 2.9 over this region (Posthuma et al. 2006). Frudakis et al. (2003) showed that of 16 pigmentation genes screened, by far the most numerous and consistently strong SNP allele associations with respect to iris colors and shades measured in different ways were found at OCA2 (in contrast to those from MYO5A), suggesting that the aforementioned linkage results were due to the influence of the OCA2 gene. As this paper was being written, other authors demonstrated that diplotypes from three OCA2 SNPs are sufficient to explain most of the variation in human iris color (Duffy et al. 2007).

OCA2 is the human homologue of the mouse pink-eyed (p) dilute mutation (Sturm et al. 2001; Rinchik et al. 1993; Brilliant 2001; Oetting et al. 1998) and the OCA2 protein product is distributed in the melanosomal membrane. Though the function of OCA2 has not yet been determined, it shares structural similarity with ion transporters (i.e. the E. coli Na+/H+ anti-porter), possibly indicating that it is involved in pH regulation (Puri et al. 2000). pH is thought to be important for the activity of tyrosinase (TYR), which is integral the both the eumelanin and pheomelanin biosynthesis pathways. It has been suggested that because p mice show perturbed eumelanin with normal pheomelanin production, OCA2 is involved only in eumelanin anabolism but this conclusion may be simplistic. Because TYR catalyzes two separate biochemical reactions in the eumelanin pathway and only one in the pheomelanin pathway it may be that certain OCA2 mutations affect the two related pathways disproportionately. The OCA2 gene is characterized by remarkable locus diversity—over 1,129 OCA2 SNPs are known to exist. The purpose of the work reported herein was to more fully characterize the association between OCA2 polymorphism and iris pigmentation. From a survey of 271 OCA2 SNPs, using a more objective phenotyping system, we confirmed the top six OCA2 associations we had previously reported (Frudakis et al. 2003) and identified an additional 27 associated SNPs spanning the OCA2 locus. We show that when alleles of these SNPs are assembled into multilocus haplotypes (diplotypes), the iris color scores among samples with matching genotypes are over 96% concordant. The level of concordance for new samples not used in the screening step was similar to that for samples used to identify the SNP associations. Our results suggest that OCA2 diplotypes can be used for predicting iris color from DNA with good accuracy.

Methods

Samples

Samples were collected from volunteers at scientific conferences and students as well as local residents using informed consent and a basic biographical questionnaire for the self-reporting of ethnicity. Our survey area extended throughout the continental US. Cryptic population structure differences among subjects of different phenotype can contribute towards type I error, which has been suspected as a problem with SNP-based association studies in the past (Hoggart et al. 2003; Terwilliger and Goring 2000). We attempted to minimize the contribution of cryptic population stratification by using samples from individuals that self-reported as “Caucasian”, or similar by screening against the use of terms “African”, “African American”, “Asian” or “Indigenous American”. Then, as have others before us working on a variety of diseases (Fernandez et al. 2003; Molokhia et al. 2003; McKeigue 1997; Zhu et al. 2005; Reiner et al. 2005) and other phenotypes (Parra et al. 2004; Shriver et al. 2003; Smith et al. 2004; Yang et al. 2005; Bonilla et al. 2004), we estimated individual genomic ancestry admixture for each sample. These estimates serve two purposes—eliminating samples from the analysis with high non-European admixture eliminates elements of potentially confounding structure (particularly since iris colors are unequally distributed among world populations), and after identifying associated OCA2 SNPs, we can use the estimates to distinguish those that are in linkage disequilibrium (LD) with phenotypically active loci from those that are merely markers of elements of residual population structure (i.e. lower levels of admixture) correlated with phenotype (Rosenberg et al. 2005; Hoggart et al. 2003; Choudhry et al. 2006; Parra et al. 1998; McKeigue et al. 2000; Pfaff et al. 2001). Maximum likelihood estimates of individual biogeographical ancestry (BGA) admixture were determined using methods described previously by others (Bernstein 1931; Chakraborty 1986) and 176 ancestry informative markers (AIMs) (Halder et al. 2006; Frudakis et al. 2003; Frudakis 2005; Shriver and Kittles 2004) chosen from the genome based on their information for a continental, 4-population model. For describing elements of this model, we found the terms “European” (genetic ancestry shared among Europeans, Middle Eastern and to a lower extent South Asians), “sub-Saharan African”, “East Asian” and “Indigenous American” (genetic ancestry shared among American Indians, Latin and South American Indians and certain Central Asian populations) were convenient. The choice of a 4-population model was based on previously published hypothesis-free cluster analyses of world-wide populations (Rosenberg et al. 2002; Shriver et al. 2005), and the nomenclature of the parental populations is reflective of modern populations distributions—not necessarily the distributions of the original parental populations that existed from 15,000–50,000 years ago.

Phenotyping

It is suspected by many that results from association scans have been difficult to replicate due in part to imprecise definition of phenotypes, resulting in the measurement of numerous covariates simultaneously (Terwilliger and Goring, 2000). We therefore invested substantial effort into phenotyping our subjects in a manner best suited for testing our hypothesis that variable OCA2-mediated melanin production was the primary determinant of human iris colors. We used digital photography and in-silico spectrophotometry to quantify the eumelanin content of each iris irrespective of its iridial patterning in terms of an Iris Color Score (“C”, see Supplementary Materials). The C value combines the luminance, red, green and blue color reflectance values from four quadrants of each iris into a single variable. Darker irides with greater eumelanin content are characterized by lower C values and lighter irides with less eumelanin are characterized by higher C values.

Genotyping

Genotyping was carried out using single base primer extension and a Beckman SNPstream instrument (Beckman-Coulter, Fullerton, CA, USA). Amplification primers and multiplexes thereof were designed using Beckman’s Autoprimer software program (http://www.autoprimer.com). Reaction conditions were as described per recommended conditions for the SNPstream, however Hot Master taq was used in all PCR reactions.

Statistical methods

Allele frequencies were derived from counts of genotype i using the function pi = (xi/2n), where xi is the number of times that the genotype or haplotype i was observed and n is the number of patients in the group. Haplotypes were inferred using the PHASE program (Stephens et al. 2001; Stephens and Donnelly 2003). Allele and genotype associations were analyzed using a 2 × 2 contingency table with a two-tailed Fisher’s Exact Test (GraphPad Software Inc.) or using multiple regression with MedCalc for windows Statistics for Biomedical research software version 9, as indicated. Regression of admixture on iris color scores was also carried out using MedCalc software. Multiple regression coefficients of determination were calculated from models incorporating the indicated sets of SNPs as independent variables, except for the analysis of all 33 SNPs simultaneously. Due to the large number of variables, the coefficient for all 33 SNPs was estimated by adding those for the contiguous SNP sets defined by the LD block structure, as described in the text. Power calculations were performed using software provided by Purcell and colleagues (Purcell et al. 2003) (Genetic Power Calculator) with observed values for allele frequencies, phenotype prevalence (0.5 for either light or dark irides), genotype relative risk (for the associated phenotype) and D’ (1.0, since the OCA2 gene is likely to be the main iris color pigmentation gene). Relative risk and odds ratios with their 95% confidence intervals were calculated using the Odds Ratio Generator for Windows (Devilly 2005). LD and Hardy–Weinberg Equilibria were calculated using HAPLOVIEW software (Barrett et al. 2005) and confirmed using Arlequin software (Schneider et al. 2000). For LD calculations we used 3,000 permutations and 10 initial conditions and for HWE calculations we used 100,000 steps in the Markov chain and 1,000 dememorisation steps. LD and haplotype block analysis for the identification of hap-tag SNPs was accomplished using the program HAPLOVIEW (Barrett et al. 2005).

Results

Iris color phenotyping

Work attempting to explain the genetic variation of iris color phenotypes conducted to date by us and others (Eiberg and Mohr 1996; Zhu et al. 2004; Frudakis et al. 2003) have used subjective iris color classifications (i.e. blue, hazel, brown, etc.). We attempted to objectify the quantification of iris color using digital spectroscopy. For each sample, an iris color score (C) was determined using a function incorporating the basic elements of the C.I.E. color wheel (luminosity, red reflectance, blue reflectance, green reflectance, see Supplementary Material). Iris color scores ranged from 2.9 to 0.9 (average = 2.07, Table 1). The distribution of iris color scores was continuous, and appeared bi-modal with one peak around 1.85 (which most observers would consider a “brown” iris) and another around 2.65 (which most would consider a “blue” iris) (Fig. 1).
Table 1

Iris color parameter statistics (n = 1,072)

 

Luminosity

Red

Green

Blue

Ca

Average

149.9

168.9

147.1

113.0

2.1

Min

47.3

64.5

42.1

22.3

0.9

Max

228.0

239.6

231.5

213.3

2.9

Range

180.7

175.1

189.3

190.9

2.0

Stdev

32.4

27.0

36.8

41.8

0.4

aIris color score

Fig. 1

The continuous iris color score (C) distribution is bi-modal in a sample of 1,756 subjects. Plotted is the number of samples for which the C value falls within the indicated range. Two maxima are observed—one in the range of C values normally encountered for “brown” eyes and one in the range normally encountered for “blue” eyes but a significant fraction of the sample is characterized by lighter (C > 2.6), darker (C < 1.9) or intermediateC (1.9 < C < 2.6) values

OCA2 screening

OCA2 has already been implicated as the locus underlying most variation in human iris colors. Our aim here was to more comprehensively mine the OCA2 locus for iris color information, and identify a compliment of markers that could be used to predict iris color from DNA. A preferred approach would rely on hap-tags from the hap-map project but at the time this work commenced, hap-tags for the OCA2 locus were not yet available. Thus, we chose to apply a “shotgun approach” for identifying OCA2 SNPs associated with iris color—that is, picking SNPs randomly to cover the locus evenly with respect to map position. We then mapped the haplotype and phenotype structure of the locus with respect to the associated SNPs and identified the most parsimonious complement fully predictive for iris color.

We first examined 147 biallelic OCA2 SNPs randomly selected from the NCBI:dbSNP database, each with a validated population allele frequency. We focused on higher frequency (minor allele freq. > 5%), rather than on coding SNPs and the selected 147 candidates covered the length of the locus. We broke a sample of 1,317 subjects into two eye color score groups—those with color scores above the mean value of 2.07 (including blue, green, blues and greens with small brown rings/sectors/flecks, etc.) and those below (the darker irides; hazels, blues with thick brown rings, solid browns, etc.). We then used the Fisher’s exact test to identify alleles with unequal distribution among these two groups, considering either association at the level of the allele or at the level of the genotype. The most strongly associated SNPs (marginally associated, n = 22) are presented in Table S1a (OCA2-A-1 through OCA2-C-6; Supplementary Material). The odds that a particular genotype would be found in an individual of a color type with which the genotype was associated (color type and genotype indicated in column 4, Table S1a) ranged from 1.5 to 18 and for most loci, the ratio was significantly above 1.0 (Assoc’d Genotype Odds Ratio, column 8 Table S1a). Regression analysis showed the associations with quantitative iris color scores were stronger than those obtained from contingency analysis based on discrete iris color bins (column 7, Table S1a). Since the SNPs are located within the same gene, which we already know is linked with human iris color, a correction for multiple tests was not appropriate. Six of the SNPs were previously described as associated with iris color (Frudakis et al. 2003, column 6, Table S1a) and their P values were of similar strength as previously reported (column5, Table S1a). Due to the relatively high minor allele frequencies, our study was well powered to detect these strong associations—for most we had a power well over 90% (column 9, Table S1a). Among the 1,317 samples, we identified 88 multilocus SNP genotypes present for more than one sample. For 48 of these, the samples possessed concordant iris colors (range of C across all samples with the genotype ≤1.0) and for the remaining 40, the samples exhibited iris color score discordance (range of C across all samples with the genotype >1.0). Given our hypothesis that common OCA2 SNPs explained most of the crude aspects of iris color variation (e.g. melanin content), this result suggested there remained other SNPs associated with iris color yet to be identified.

To identify these SNPs most efficiently, we searched a second set of 124 randomly selected OCA2 SNPs not previously tested in the first screen. We screened these 124 SNPs for conditional associations within the discordant genotype groups just discussed. That is, we were searching for SNPs that helped resolve the iris color score discrepancies we observed within the 40 multilocus genotype groups. An additional set of 11 associated SNPs were found (OCA2-D, Table S1b). Genotypes for each resolved discordant iris colors within at least one confounded multilocus genotype group and genotypes for several resolved discordant iris colors within multiple confounded multilocus genotype groups (Column 8 “Confounder Groups Resolved”, Table S1b). Alleles for six of these 11 were marginally (independently) associated with iris color scores as quantitative variables as well as binned iris color scores (above and below C = 2.2), but alleles for five were only associated with the former (Table S1b). This hierarchical screening approach allowed us to first discover numerous SNPs associated with iris color scores from our first screen, identify the gaps in iris color information content for this set (that would be required for accurate inference of iris color score), and then specifically fill those gaps from SNPs selected in a secondary screen.

The addition of the 11 OCA-2-D SNPs gave us a total of 33 SNPs spanning the OCA2 locus from 15.05 to 16.36 cM along chromosome 15. Most were located within introns, though one (OCA2-C-1) was non-synonymous, two were synonymous and three were located in the 3′UTR (column 2, Tables S1a and S1b). Regressed on iris color scores, the 33 associated SNP alleles showed a multiple correlation coefficient (R) of 0.637 and a coefficient of determination (R2) of 0.406. Linkage disequilibrium (LD) analysis of the 33 associated SNPs revealed that many were in LD with one another (Fig. 2). Six distinct haplotype blocks were identified as accounting for most of the diversity of the OCA2 locus represented with the 33 SNPs (Fig. 2) and within these blocks, 11 of the SNPs were selected by the HAPLOVIEW program as hap-tag SNPs, necessary for capturing the diversity represented by these blocks (SNP 1: OCA2-B-1, SNP 2: OCA2-D-1, SNP5: OCA2-C-2, SNP6: OCA2-C-3, SNP7: OCA2-D-3, SNP9: OCA2-C-4, SNP 18: OCA2-A-4, SNP 20: OCA2-A-7, SNP 21: OCA2-C-1, SNP 22: OCA2-A-8, SNP24: OCA2-C-6). About half of the SNPs were not easily allocated to any one particular haplotype block and alleles for none of the 33 SNPs were in complete LD with another (e.g. examples of recombinants were found for all pairs). To test whether the associations for the non-hap-tagged SNPs were independent from those of the hap-tag SNPs, we used regression models. With iris color score as the dependent variable, we tested each non-hap-tag SNP, one at a time with the set of 11 hap-tag SNPs as the independent variables. This analysis showed that alleles for many of the non hap-tag SNPs (8/22) were associated with iris color independently from the hap-tag SNPs, indicating that they provided additional iris color information (Table 2b). The six SNPs that we had previously described as associated with iris color fall (Frudakis et al. 2003) fall within LD blocks 4 and 5 (OCA2A-4, OCA2A-5, OCA2A-7, OCA2A-9, OCA2A-6 and OCA2A-9). To assess whether the iris color associations were independent of these, we performed a similar multiple regression analysis incorporating each of these six OCA2 SNPs as independent variables (grey shading, Table 2a), and testing the significance of the iris color association for each remaining SNP, one at a time, as the seventh independent variable. The significance of the association for most of the SNPs (16/27) was independent of these six, indicating that they provided additional information about iris color.
Fig. 2

Linkage disequilibrium (LD) structure of OCA2 locus viewed through the lens of the 33 SNPs found to be strongly associated with iris color scores. The figure was created using the Haploview program (Barrett et al. 2005). SNPs are identified by number along the OCA2 locus (bar, top of figure) as well as by the name used in Table 2a and b. Position within the OCA2 locus is indicated with lines. Boxes corresponding to pair-wise R2 = 1 are shown in black, R2 = 0 in white, and intermediate values in successively darker shades of grey the higher the value. There exists extensive LD among many of these 33 SNPs and six LD blocks were defined as indicated with diamond enclosures. However, none of the SNPs was found to be in complete LD with any other (that is, recombinants can be found)

Table 2

Multiple regression analysis shows that most of the 33 associated OCA2 SNPs provide iris color information over and above that from previously published associations (Frudakis et al. 2003) or the hap-tag subset discussed in the text

(A) Frudakis et al. (2003) set + each SNP

(B) Hap-tags + each SNP

OCA2-A1

0.2434

OCA2-A1

0.5927

OCA2-A2

<0.0001

OCA2-A2

0.1837

OCA2-A3

0.6919

OCA2-A3

0.2967

OCA2-A4

 

OCA2-A4

 

OCA2-A5

 

OCA2-A5

0.2543

OCA2-A6

 

OCA2-A6

0.201

OCA2-A7

 

OCA2-A7

 

OCA2-A8

 

OCA2-A8

 

OCA2-A9

 

OCA2-A9

<0.0001

OCA2-A10

<0.0001

OCA2-A10

<0.0001

OCA2-B1

0.0007

OCA2-B1

 

OCA2-B2

0.0002

OCA2-B2

0.9874

OCA2-B3

0.413

OCA2-B3

0.3912

OCA2-B4

0.0001

OCA2-B4

<0.0001

OCA2-B5

0.0431

OCA2-B5

0.0001

OCA2-B6

0.0289

OCA2-B6

0.0038

OCA2-C1

0.6922

OCA2-C1

 

OCA2-C2

<0.0001

OCA2-C2

 

OCA2-C3

0.0005

OCA2-C3

 

OCA2-C4

0.0001

OCA2-C4

 

OCA2-C5

0.1476

OCA2-C5

0.6727

OCA2-C6

0.1404

OCA2-C6

 

OCA2-D1

0.0324

OCA2-D1

 

OCA2-D2

0.0001

OCA2-D2

0.9526

OCA2-D3

<0.0001

OCA2-D3

 

OCA2-D4

0.7786

OCA2-D4

0.12

OCA2-D5

0.9765

OCA2-D5

0.4639

OCA2-D6

0.0087

OCA2-D6

0.0881

OCA2-D7

0.2482

OCA2-D7

0.6096

OCA2-D8

0.1464

OCA2-D8

0.1138

OCA2-D9

0.0679

OCA2-D9

<0.0001

OCA2-D10

<0.0001

OCA2-D10

<0.0001

OCA2-D11

<0.0001

OCA2-D11

<0.0001

In our sample, homozygotes and the heterozygotes were observed for each of the 33 SNPs. Though departures from HWE are well-tolerated by PHASE (Stephens et al. 2001) we tested each SNP for HWE and found that alleles for 10 of the SNPs were determined not to be in HWE in our predominantly European sample (Supplementary Material, Table S2, those with P < 0.05, column 4). Each of these 11 cases was a function of an overrepresentation of homozygotes, rather than heterozygotes. For three of these, the genotype associated with lighter (higher) color scores was overrepresented (OCA2-B-3, OCA2-C-3, OCA2-D-2) but for the remaining seven, the “darker” genotype was overrepresented. Genotyping error is a potentially trivial source of such an observation, but genotypes for each of these 11 SNPs routinely passed Beckman UHT quality control checks, and the number of ambiguous genotype calls were no greater for this set of SNPs than for those in HWE. With the Beckman UHT, the reliability of the genotype calls is inversely proportional to the distance from its assigned genotype cluster (in terms of a two-dimensional plot of allele X and Y signal), which is measured with a “LOD” score. The compactness of the genotype clusters is measured by averaging the “LOD” variables for calls within and across the clusters. When genotyping problems are encountered, lower LOD values obtain, indicating dispersed clusters that merge into one another. The tightness of the genotype clustering for those SNPs not in HWE (average LOD = 5.29) was not significantly different from those that were in HWE (average LOD = 5.16).

Classification accuracy

We next investigated whether alleles for the 33 associated SNPs could be used to infer iris color. We did this by assessing concordance of iris colors among samples with the same inferred OCA2 diplotypes among these SNPs. To test the concordance of iris colors for samples of shared OCA2 diplotypes, we studied concordance among 602 of the original set of 1,317 samples for which remaining DNA was available (“Model” or Discovery samples, part of the sample used to select the OCA2 SNPs) as well as within a separately collected sample of 470 (“Validation” samples, not used in the selection of the 33 OCA2 SNPs). Each of these samples was genotyped (or re-genotyped, as the case may be) for all 33 of the OCA2 SNPs and for several subsets of the 33 SNPS (described fully below) we
  1. 1.

    inferred haplotypes for all 1,072 samples,

     
  2. 2.

    assembled the samples into OCA2 diplotype groups,

     
  3. 3.

    calculated the iris color score concordance among samples of shared diplotypes.

     
If variable melanin content of the iris is a function primarily of OCA2 variation, and if the phase-known alleles of a selected set of associated OCA2 SNP markers are necessary and sufficient to infer iris color, then the database of phenoypes and OCA2 diplotypes with respect to these SNPs constitutes a rudimentary iris color classification system. That is, the iris color of one sample with a given OCA2 diplotype should tell us the iris color of another sample with the same OCA2 diplotype, and a measure of iris color concordance among samples sharing diplotypes is a measure of the classification accuracy for both Model (used to select the SNPs) and Validation samples (not used to select the SNPs). To determine whether the color of an iris 1 was concordant with that of others {2, 3, 4, ..., n} of the same OCA2 diplotype, we calculated the average color score C for {2, 3, 4, ..., n} as the point estimate of sample 1, took a range (R) around this estimate, and determined whether the color score of sample 1 fell within R. We specified a range about the estimate because we are attempting to infer melanin content directly and color indirectly, and irides of the same melanin content likely express slightly different colors, patterns, and consequently, C values. We first focused on the hap-tag OCA2 SNPs from our LD analysis (Fig. 2). Using a range R = 1.0 about the inferred C for each iris (±0.5 around the inferred color score for each), the iris color scores for samples with the same OCA2 hap-tag diplotypes were 83.7% concordant. Put another way, when predicting the iris color for a sample by using the iris color of other samples with matching OCA2 hap-tag diplotypes as a guide, the predictions were correct or within the inferred range 83.7% of the time. We observed a similar value for model and validation samples (82.7 and 86.3%, respectively) (Table 3, “hap-tag” rows). Similar but lower rates of iris color score concordance were observed using an R = 0.9 about the point estimate of iris color (Table 3). In multiple regression models, the coefficient of determination for the entire set of 11 hap-tag SNPs was R2 = 0.12.
Table 3

Iris color concordance within multilocus OCA2 genotype groups using two different evaluation criteria

Haplotypes

Sample

Matched samples (N)

%within predicted range

R = ±1.0a (I = 23)b

R = ±0.9a (I = 20)b

Hap-tags

Model

393

82.70

76.30

Validat

153

86.30

73.20

Total

546

83.70

75.50

Contiguous

Model

23

91.30

87.00

Valid

7

100.00

100.00

Total

30

93.30

90.00

p-Selected

Model

56

96.40

92.90

Validat

26

96.15

92.30

Total

82

96.30

92.70

Random set 1

Model

57

63.20

54.40

Validat

25

72.00

68.00

Total

82

65.90

58.50

Random set 2

Model

57

70.20

64.90

Validat

25

64.00

60.00

Total

82

68.30

63.40

Random set 2

Model

57

56.10

47.40

Validat

25

48.00

44.00

Total

82

53.70

46.30

Total random set

Model

171

63.20

55.60

Validat

75

61.30

57.30

Total

246

62.60

56.10

aIris color score range

bRange of iris color score parameters

A second concordance assessment was made using all 33 SNPs ordered into contiguous blocks demarcated by the boundaries of LD structure across the locus. We broke the 33 SNPs into three sets (which we will call “contiguous” blocks)—the first covered haplotype blocks 1, 2 and 3 from the analysis of Fig. 2 (SNP 1: OCA2-B-1 contiguously through SNP 12: OCA2-B-3, multiple regression R2 = 0.0865 with iris color scores), the second covered haplotype blocks 4, 5 and 6 from the analysis of Fig. 2 (SNP 13: OCA2-D-5 contiguously through SNP 25: OCA2-C-5, multiple regression R2 = 0.0848 with iris color scores), and the third set covered the remaining SNPs from SNP26: OCA2-B-5 through SNP 33: OCA2-D-11. The diplotypes for SNPs organized this way revealed a multiple regression R2 = 0.2342 with iris color scores. Using an R = 1.0 about the point estimate of C for each iris, the iris color scores for samples with the same OCA2 diplotypes across all three of the contiguous blocks were 93.3% concordant. Values for model and validation samples were similar with respect to the sample sizes available with matches (91.3 and 100.0%, respectively) (Table 3, “contiguous” rows). Similar but lower rates of iris color score concordance was observed using a range R = 0.9 of iris color scores (Table 3).

The final assessment involved partitions of the 33 associated SNPs into blocks that were empirically determined to be of the highest predictive value for iris color. Among the 22 SNPs identified from the first screen, we assembled each possible (2, 3, ..., 11) SNP combination, inferred diplotypes with respect to each, and determined the Fisher’s P value for association with binned iris color scores. Repeating this process for each possible set of combinations, we determined that haplotypes for the sets defined by the groupings in Table S1a (OCA2-A: 11 SNPs, OCA2-B: 6 SNPs, OCA2-C: 6 SNPs) provided the lowest combined Fisher’s P values. To these three sets of SNPs we added a fourth set comprised of those in Table S1b (OCA2-D: 11 SNPs), which were originally selected because they provided information in addition to those of the optimally associated OCA2-A, B and C groupings. We will term these SNP groupings as “p-Selected”. Using an R = 1.0 about the point estimate of C for each iris, we observed the iris color for 96.3% of the samples with matching p-Selected diplotypes fell within the predicted range (P < 0.0001), with a similar rate for Model (96.4%, P < 0.0001) and Validation samples (96.2%, P = 0.0018, shaded region, Table 3). In contrast, the average frequency with which the iris color of randomly selected samples fell within these same ranges was only 62.6% (Table 3) and this difference was highly significant (P < 0.0001). Using an R = 0.9 about each point estimate, we observed the iris color for 92.7% of the samples with matching p-Selected diplotypes fell within the predicted range and the rate with which the iris color of Model samples fell within the predicted range (92.9%) was similar to that with which the iris color of Validation samples fell within the predicted range (92.3%) (Table 3). In contrast, the frequency with which the iris color of randomly selected samples fell within this particular range was only 56.1% (Table 3) and this difference was highly significant (P < 0.0001). Multiple regression models using iris color score as the dependent variable revealed Coefficient of Determinations (CODs) suggesting that the p-Selected SNP sets combine to explain a significant amount of variation in human iris color scores (OCA2-A COD = 0.1874, OCA2-B COD = 0.1247, OCA2-C COD = 0.0660, OCA2-D COD = 0.2544).

Gallery display of iris color inferences

From the preceding analyses, it was apparent that the p-Selected SNP groupings provided the highest value for inferring iris color. Since iris color scores are difficult to visualize, we developed a method for using the point estimates to present the inferred iris color scores and ranges in terms of digital photographs. For each sample with a matching p-Selected OCA2 diplotype, we used ranges about the components of the C point estimates to query the database for samples of similar C value (see Supplementary Material). With a database of 1,072 samples, the galleries typically include 100 or more iris photographs (dependent on the values for the test iris), but in Fig. 3 we have presented six photographs (below the line) representative of the range for each of the 82 “test” samples with matching p-Selected OCA2 diplotypes (iris photograph above the line). Though the summary statistics in Table 3 represent the formal demonstration that the p-Selected OCA2 diplotypes we have described are predictive for iris color, Fig. 3 illustrates that the predictions are also visually satisfying (i.e. “correct”, by most reasonable accounts). In visual terms, the results seem similar for Validation samples (the 26 boxed photo sets, Fig. 3) and Model samples (56 unboxed photo sets, Fig. 3). Those for which the actual color score of the test iris fell outside of a relatively narrow predicted range (using a range of C spanning 0.9 about the point estimate, see Supplementary Material) may be considered “incorrect”, and are shown in the lower right of Fig. 3 (samples 77–82, with boxes or circles around the sample ID). Those for which the actual color score of the test iris fell outside a relatively broad range (using a range of C spanning 1.0 about the point estimate, see Supplementary Material) can be considered more grossly incorrect, and are shown in the lower right with boxes around the sample ID (samples 80–82, Fig. 3). Most human observers would probably conclude that many of these latter six predictions were inaccurate, though some may consider samples 78 and 80 correct. Though the use of terms such as “light” or “dark” are subjective, visually speaking, the predicted ranges tend to be fairly restrictive, either clearly specifying a light (i.e. samples 10–17, etc. Fig. 3), intermediate (i.e. samples 4, 19, 20, 49 etc, Fig. 3) or dark color (i.e. samples 3, 7- 9, 47, etc., Fig. 3). One exception is sample 74, which shows a wide color range including blues and browns, but all of dark hue (like the test sample). Of the 82 samples with matching p-Selected OCA2 diplotypes in the database, 62 (76%) specified iris color score ranges covering mainly “lighter” colors and hues.
Fig. 3

Predicted and actual iris colors for each sample with a multilocus OCA2 genotype match in the database described in the text (n = 1,072). A sample (n = 6) of irides corresponding to the predicted range of colors and patterns is shown below the horizontal line, corresponding to the actual (“test”) iris immediately above the line. Predictions for 82 irides are shown, and were made based on the iris color score of the matching sample(s) in the database, as described in the text. Boxed sets correspond to test samples not used in the discovery of the SNP associations that comprise the multilocus OCA2 genotypes. Circles around the sample number indicate that the test iris fell within the range of predicted color scores when determination of the range of the prediction used the I = ±23, C = ±0.5 criteria but not the I = ±20, C = ±0.45 criteria. Samples with boxes around the sample number indicate that the test iris did not fall within the range of predicted color scores using either range criteria and all others fell within the predicted range using both criteria. This figure illustrates the general concordance of iris colors among samples with matching OCA2 multilocus genotypes as we have defined them in the text and Tables 2 and 3

Associations independent of population structure

Since iris colors are unevenly distributed among world populations we investigated whether our associations were merely reflections of (i.e. correlations with) crude, continental population stratification (i.e. false positives). Such correlations would result in an ability to correctly infer iris color as we have just discussed, but for the wrong reasons (because they are good markers of ancestry admixture, not because they describe variation in the underlying phenotypically active locus). We qualified the BioGeographical Ancestry Admixture (BGA) of all the Model samples with respect to a set of 176 Ancestry Informative Markers (AIMs) (Frudakis et al. 2003; Frudakis 2005; Shriver and Kittles 2004; Halder et al. 2006), and a continental population model (Shriver et al. 2003, 2005; Rosenberg et al. 2002). Our sample was comprised predominantly of Caucasians, and most samples typed of predominantly “European” admixture (Fig. 4a) though many of the samples exhibited extensive admixture. The sample of individuals with iris color scores above the average was of similar average genomic admixture as the sample below the average (C = 2.07) (Fig. 4a). However, we noted that European admixture was positively correlated with lighter iris colors (R = 0.28, P < 0.0001; Fig. 4a) and sub-Saharan African (R = −0.24, P < 0.0001; Fig. 4b), East Asian (R = −0.1042, P < 0.0059; Fig. 4c) and Native American admixture (R = −0.152, P < 0.0001; Fig. 4d) were correlated with darker iris colors. Essentially the same results obtained when for each individual, instead of using the most likely estimate (MLE) of admixture, the high or low estimates of admixture bounding the 90% confidence interval were used (European low estimates: R = 0.2803, European high estimates: R = 0.2816, African low estimates: R = −0.1967, African high estimates: R = −0.2458, East Asian low estimates: R = −0.1169, East Asian high estimates: R = −0.1111; Native American low estimates: R = −0.1327, Native American High estimates: R = −0.1523). These results raised the possibility that the OCA2 information inherent to our SNPs, hap-tags, contiguous OCA2 diplotypes and p-Selected OCA2 diplotypes were due to ancestry information content – that is, that the OCA2 markers and marker sets were informative for iris color because they were markers for an element of ancestry correlated with iris colors (i.e. they were merely artifacts of population structure). To test this, we performed multiple regression analysis for each of the OCA2 SNPs. Genotypes for each of these SNPs were incorporated one at a time into a multiple regression model along with non-European (sub-Saharan African, East Asian and Indigenous American) admixture as independent variables, to assess the independence of each SNP association from ancestry correlation (dependent variable). Alleles for most of the OCA2 SNPs (29/33) were associated with iris color scores independently from non-European (or by definition, European) admixture (Supplementary Material, Table S3). As with the regression analyses of Fig. 4, none of these results changed significantly when, for each individual, the low or high estimate bounding the 90% confidence interval was used instead of the MLE (the p-values for each SNP association were essentially identical to those obtained using the MLE; data not shown). These results indicated that the OCA2 SNP allele associations were not due to population stratification and BGA admixture on a continental scale.
Fig. 4

Regression of Iris Color Scores on individual genomic ancestry admixture (IGAA) estimates for 695 Caucasians part of our Discovery set of samples. Iris color scores are correlated with European (a) and non-European, (b) sub-Saharan (Western) African, (c) East Asian, (d) Indigenous (Native) American admixture. Levels greater than about 30% non-European admixture may be suitable as stand-alone classifiers for human iris color, but lower levels are not

Discussion

Previous work has implicated OCA2 as the primary locus underlying variable iris pigmentation. Though a number of previous OCA2 SNP associations have been identified (Frudakis et al. 2003; Duffy et al. 2007), and though most of the variation in iris colors is accounted for by OCA2 polymorphism (Zhu et al. 2004; Duffy et al. 2007), our understanding of the genetic basis of iris color has remained incomplete as evidenced by our inability to accurately predict even crude aspects of iris color from OCA2 SNP genotypes. This inability is a function of a few basic problems: (1) few of the polymorphisms are completely dominant or seem to exert their influence in a context-independent manner, (2) iris color is difficult to phenotype objectively, (3) the OCA2 locus has not been comprehensively screened for iris color information and the most important SNPs may not yet be identified, and (4) the genetic rules for how OCA2 gene variants influence iris color is not yet understood. In addition, until now it had not been proven that OCA2 alleles are truly associated with iris color as opposed to correlated with iris colors through their ancestry information. Herein, we attempted to address each of these problems simultaneously. We developed a more objective phenotyping system for iris color based on iris color scores—an in-silico method for simplifying an extraordinarily complex phenotype and specifically estimating the melanin content of the iris (irrespective of how it was patterned). Using these iris color scores, we performed the most comprehensive OCA2-iris color SNP screen so far described (271 SNPs), confirming some previously described OCA2 associations (Frudakis et al. 2003; Duffy et al. 2007) and identifying several new ones. We qualified our analysis with respect to cryptic continental level population structure and lastly, we developed a diplotype-based classification system for the inference of iris color that does not require a prior understanding of the phenotype inheritance rules. From this multi-faceted approach we made a number of interesting observations. P values for the SNP associations were generally stronger with iris color as a quantitative variable compared to those obtained through contingency analysis with binned iris colors (based on these same quantitative scores). Indeed, some of the SNPs we identified (e.g. OCA2-D-8, OCA2-D-9) were only associated with iris color as a quantitative variable and these results suggest that the parameterization of iris color enhanced the power of our study. The associations we identified were shown to be independent from those previously described, and independent of the correlation between continental BioGeographical Ancestry admixture and iris color. Iris colors are unevenly distributed among continental populations, and given the correlation between admixture and iris color we observed, it was incumbent on us to prove that our associations were independent of such admixture. Having done so, our analysis represents the first formal demonstration that the OCA2 locus is associated with human iris colors per-se. The accuracy of our classification system was shown to depend on the marker set used—with a hap-tag subset performing the worst, contiguously arranged SNPs next best and empirically identified SNP sets performing best. This order of performance mirrored the ranking in coefficient of determination (R2) among these sets, suggesting that our classification results were specifically a function of iris color information content. Overall, our results confirm and extend observations from those before us that OCA2 is a primary iris color gene and suggest that certain OCA2 polymorphisms are necessary and sufficient as predictive markers for the crude aspects of iris color (namely, overall melanin content). We were able to pinpoint the iris color of test samples to within a range of values covering about half of the observed population range. This is essentially equivalent to being able to infer the shade, or lightness of the iris from DNA (but not necessarily the specific pattern or precise color). This is only a first step, and further refinements will clearly require the identification of additional OCA2 polymorphisms, polymorphisms in other genes and/or environmental factors that influence the expression and patterning of specific iris colors.

It has not yet been demonstrated that the polymorphisms we have described are functionally relevant—that is, they represent phenotypically active variants. Indeed the same can be said of the OCA2 gene association itself. However our results here combine with a preponderance of evidence from prior work to suggest that OCA2 does indeed represent the major human iris color gene. First, as shown herein, the associations are bona-fide associations with the iris color phenotype, not correlated population structure, and sufficient to predict iris color shade with excellent accuracy. Second, we know that mutations in OCA2 cause human albinism, including some forms that influence iris, but not hair or skin pigmentation (Boissy et al. 1996; Brilliant et al. 2001; Oetting et al. 1998; Rinchik et al. 1993). This latter observation is relevant because natural iris colors are inherited independently from hair and skin pigmentation phenotypes. Third, previous linkage and association studies have implicated OCA2 and the region of chromosome 15 containing OCA2 as explaining most of the variation in naturally occurring iris colors (Eiberg and Mohr 1996; Zhu et al. 2004; Posthuma et al. 2006). Fourth, the OCA2 locus is by far the most polymorphic of the known human pigmentation genes, which seems to fit with the quasi-Mendelian inheritance of iris color Alternatively, the implication of a relatively simple gene (short, fewer polymorphisms) as the primary determinant for iris color would not have been as consistent with the quasi-Mendelian nature of iris color inheritance.

Database matching for classification

It is relatively uncommon for human SNP associations to be so fundamentally and profoundly tied to an element of phenotype expression that the phenotype is adequately predicted from SNP genotypes alone. This may be due in part to the fact that most association studies are not aimed at enabling phenotype prediction, and so rarely proceed to such an in-depth dissection of the phenotype–genotype relationship as we have executed here. The methods we used to predict the gross aspect of iris color from OCA2 diplotypes may be of interest to geneticists focused on a variety of problems and systems because the relationship between specific genotypes and complex and/or quantitative phenotypes has implications for better understanding not only the genetic architecture of the phenotype, but its evolutionary history. Prior to the human genome era, the prediction of complex phenotypes from genetic measurements had not been possible. The introduction of DNA microarrays broke this barrier and enabled the detection of highly characteristic and predictive RNA expression signatures for various phenotypes. In ovarian cancer alone, predictive gene expression signatures have been identified as features of tumor metastasis (e.g. Ramaswamy et al. 2003), malignancy (e.g. Ouellet et al. 2005), tumor drug resistance (e.g. Helleman et al. 2006) and cancer prognosis (Bild et al. 2006; Meinhold-Heerlein et al. 2005). However, genotype (i.e. DNA) “signatures” of multifactorial human traits have been far more elusive—probably because DNA is less directly related to phenotype than RNA. With our problem, we knew that the inheritance of iris color is relatively complex, and though we did not understand a priori the context dependence of the associations we had identified, we could clearly discern that predicting iris color from single or even small sets of highly associated OCA2 SNPs was not possible. Unfortunately, the number of variables produced from phasing all 33 SNP genotypes (1,018 multilocus genotypes) at our disposal was about the same as the number of samples (1,072), and on the surface, it would seem that any attempts at classification using such a large set would be challenging. However, this type of challenge has not prevented the successful classification of disease subtypes, drug response and prognosis using RNA expression signatures. We reasoned that if the OCA2 diplotypes we had identified unambiguously specified the most grossly observable aspect of iris color (eumelanin content), either directly through phenotypic activity or indirectly through linkage with the phenotypically active OCA2 variants, then the empirical database-matching method we described should perform well regardless of the size of the database. That is, the only penalty for a small database is that fewer “predictions” or “tests” of the hypothesis could be executed. The specificity of our results with iris color suggests that our reasoning was correct, and indicates that for some phenotyes, the use of databased diplotypes can constitute a stand-alone system for the inference of complex (multifactorial, quantitative) phenotypes. The database matching method we have described is expected to be robust to various (and still unknown) components of iris color genetic complexity, such as those one might expect to encounter within large and/or highly polymorphic genes. For example, it is possible that haplotypes with a region marginally associated with “blue” irides and a region marginally associated with “brown” irides together beget “green” irides in an additive sense, or beget “brown” irides in an epistatic sense. Neither the epistatic nor dominance effects of a locus need be universal in all haplotype contexts, and predicting the myriad of potentially complex classification “rules” using summary statistics or hap-tag SNPs seems challenging for such a complex gene, if not impossible. Yet each is theoretically accommodated by a database-matching system even if they are not yet defined, and no matter how small the database (or large the number of variables). Of course, the method is the most basic imaginable, and while it enjoys an empirical power not usually associated with methods that rely on generalizations based on population averages, it suffers from the requirement of very large sample sizes if one hopes to be able to classify most new samples encountered (particularly for large genes and/or systems involving many genes). For example, as a result of the current size of our database (n = 1,072), we could only “classify” about 8% of the iris colors in our sample (82 with p-Selected OCA2 diplotype matches out of 1,072 total samples). If we assume that our results extend to other samples, as our validation sample results suggest they do, then improving on this rate is simply a matter of building the database. In addition to increasing the number of shared OCA2 diplotypes, adding samples is expected to increase the number of observed haplotypes as well, but we expect a finite number of haplotypes and diplotypes to exist in nature and at some point the complexity of the database is expected to plateau. Indeed, an earlier instance of this database compiled with n = 835 samples revealed 64 OCA2-A, 37 OCA2-B and 18 OCA2-C haplotypes which is a similar level of haplotype complexity observed in the current sample of 1,072 (70 OCA2-A, 38 OCA2-B and 19 OCA2-C haplotypes), indicating that with 1,072 samples we are already well along the plateau phase. Thus, rather than merely increase the complexity of the classification system, the addition of samples to our database from this point in its development onward is expected to increase the number and population of shared OCA2 diplotypes, enabling us to classify a larger fraction of new samples. Notwithstanding, there are some who believe that any calculation of classification accuracy must involve all samples, not just those one feels comfortable or able to classify. With this criteria, the classification of all but the simplest of phenotypes from DNA is likely to wait many years until the cost per genotype comes down and research groups can afford to build databases tens of thousands of samples strong. However, we have ample evidence that focusing technology on subsamples is not only acceptable in a theoretical sense but highly effective in a practical sense as well. For example, the drug HerceptinR, is used to treat only a small fraction of breast cancer patients (those with Her2 positive cancers) because it performs well in that subpopulation. Had HerceptinR efficacy calculations been mandated in the clinical trial to include all breast cancer patients, the drug would likely not have been approved and papers on its effectiveness summarily rejected.

OCA2 phenotype complexity and demographic history

The difficulty in predicting the iris color of offspring from that of the parents is the primary evidence that the inheritance of iris color is complex. The continuous distribution of melanin content we observed is additional evidence of the complexity of the phenotype and belies the commonly held notion that iris colors are discrete. The bimodal nature of the distribution could be due to dominance but it may also be due to positive assortative mating. In support of this idea, alleles for about one-third of the SNPs we described (10/33) were found not to be in Hardy–Weinberg proportions (HWE), and our results suggested this was unlikely to be due to genotyping error or ambiguity. Assortative mating is a primary cause alleles for a polymorphism may not be in HWE (Cavalli-Sforza and Bodmer 1999), and it is not hard to imagine this force at work shaping the allelic distribution of a gene involved in iris pigmentation. If so, we might expect to find that iris colors are unequally apportioned among various European sub-populations. Indeed, studies using Eurasian Ancestry Informative Markers have demonstrated an association between sub-European population structure and iris color and along a northwestern to southeastern axis (Frudakis 2007). That is, lighter irides are more commonly found in Northwestern and Continental Europe whereas darker irides are more commonly found among individuals of genetic ancestry common in the Middle East and South Asia (regardless of what nations their recent ancestors were derived, Frudakis 2007). There is evidence for positive selection at the OCA2 locus within Europeans and Asians but not Africans (Voight et al. 2006; Lao et al. 2007) and lighter pigmentation phenotypes are generally recognized to represent a derived state from a more pigmented ancestral state (Lao et al. 2007). Thus, we might expect the overrepresented genotype for each of our 11 non-HWE SNPs to be associated with blue, rather than brown iris colors. However, we were surprised to note that for 64% of non-HWE SNPs (7/11), it was the genotype associated with darker iris colors that was overrepresented. This could be the result of differential OCA2 exchange rates; the integration of lighter-irises into populations with dark irides in Southeastern Europe, the Middle East and South Asia may have been less historically common than the reverse. Alternatively, or perhaps in addition, the disequilibrium may be the result of higher fertility rates among European and Eurasian populations with darker iris colors, reminiscent of the fertility disparity among native and immigrant populations within continental Europe today.

It is interesting that the most informative diplotypes comprised SNP sets that had to be empirically determined (the p-Selected sets) and were not necessarily related to one another by distance or LD within the locus. It is also interesting that diplotypes comprised of the hap-tag SNPs were not as predictive as diplotypes covering all 33 of our associated SNPs, whether ordered along the chromosome (such as our contiguous sets) or not (such as the p-Selected sets). The fact that most of the associations we described were independent from those of the hap-tagged SNPs, that large SNP sets were required to “explain” iris color in a classification sense, and the large number of SNPs at the OCA2 locus in general seems to suggest that over the past 50,000 years as populations expanded out of Africa and the Middle East into Europe and Eurasia, a high degree of functional allelic complexity has been maintained. Perhaps as a result of dense population structure amalgamation within Eurasia. The large number of phenotypically-relevant and/or associated OCA2 haplotypes seems to confirm that attempting to reduce even the crude elements of a complex phenotype (e.g. apparently multifactorial and/or of complex inheritance rules) to hap-tagged SNPs is not always the best approach if the goal is to predict phenotype from DNA sequences, since doing so may ignore a significant portion of locus diversity. Indeed, while this manuscript was in review Duffy et al. 2007 described a screen of 58 exonic and tagging OCA2 SNPs, finding three that explained most of the variation in visually assessed iris colors, though these authors did not attempt to classify unknown irides based on the corresponding genotypes. Two of these three SNPs are among the set we describe here (OCA2-D-10 and OCA2-D-11), but while Duffy et al. scanned the entire OCA2 locus they probably did not identify most of those we describe here because they focused on a smaller number of tagging and exonic SNPs (n = 58). In contrast, our work, was a shotgun-style association scan and imposed no assumptions about the relationship between gene location and functional relevance (indeed, most of the SNPs we describe are located within introns). However, while there may be evidence that the history of iris color evolution is rich with detail and complexity, there is also evidence that the complexity of lighter colored irides pales in comparison to that for darker colored irides. For example, of the 82 p-Selected OCA2 diplotypes shared by more than one sample (i.e. the most common diplotypes), 62 (76%) specified a range of color scores that fell predominantly in the lighter end of the spectrum. This is a by-product of greater diversity of darker multilocus genotypes compared to lighter genotypes in our database. That is, there are fewer “blue” OCA2 sequences, and individuals with blue irises therefore tend to have the same genotypes more often than individuals with brown irides. This observation fits with our expectations, based on our current understanding of the recent origins of the world’s populations, and observation that lighter iris color haplotypes represent fairly recent, derived sequences of older, darker haplotypes. Further work on the phlyogeography of the haplotypes we have described could therefore help illuminate aspects of human expansions and migrations out of Africa that have heretofore been occluded by admixture, and reverse gene flow.

Notes

Acknowledgments

We thank Shannon Boyd and Sara Barrow for assistance with genotyping and Mark Shriver and Marc Bauchet of the Pennsylvania State University for assistance in collecting samples. Our work was supported with private funds.

Supplementary material

References

  1. Abbott C, Jackson I, Carritt B, Povey S (1991) The human homolog of the mouse brown gene maps to the short arm of chromosome 9 and extends the known region of homology with mouse chromosome 4. Genomics 11:471–473PubMedCrossRefGoogle Scholar
  2. Akey J, Wang H, Xiong M, Wu H, Liu W, Shriver M, Jin L (2001) Interaction between the melanocortin-1 receptor and P genes contributes to inter-individual variation in skin pigmentation phenotypes in a Tibetan population. Hum Genet 108:516–520PubMedCrossRefGoogle Scholar
  3. Barrett J, Fry B, Maller J, Daly M (2005) Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21(2):263–265PubMedCrossRefGoogle Scholar
  4. Bernstein F (1931) Die Geographische verteilung der blutgruppen und ihre anthropologische bedeutung. in Comitato Italiano per lo Studio dei Problemi Della Populazione. Roma Instituto Poligrafico dello stato. pp 227–243Google Scholar
  5. Bild A, Yao G, Chang J, Wang Q, Potti A, Chasse D, Joshi M, Harpole D, Lancaster J, Berchuck A, Olson J, Marks J, Dressman H, West M, Nevins J (2006) Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature 439:353–357PubMedCrossRefGoogle Scholar
  6. Bito L, Matheny A, Cruickshanks K, Nondahl D, Carino O (1997) Iris color changes past early childhood. The Louisville Twin Study. Arch Ophthalmol 115:659–663PubMedGoogle Scholar
  7. Boissy R, Zhao H, Og W, Austin L, Wildenberg S, Boissy Y, Zhao Y, Sturm R, Hearing V, King R, Nordlund J (1996) Mutation in and lack of expression of tyrosinase-related protein-1 (TRP-1) in melanocytes from an individual with brown oculocutaneous albinism: a new subtype of albinism classified as ‘OCA3.’ Am J Hum Genet 58:1145–1156PubMedGoogle Scholar
  8. Bonilla C, Parra E, Pfaff C, Dios S, Marshall J, Hamman R, Ferrell R, Hoggart C, McKeigue P, Shriver M (2004) Admixture in the Hispanics of the san luis valley Colorado, and its implications for complex trait gene mapping. Ann Hum Genet 68:139–153PubMedCrossRefGoogle Scholar
  9. Box N, Wyeth J, O’Gorman L, Martin N, Sturm R (1997) Characterization of melanocyte stimulating hormone variant alleles in twins with red hair. Hum Mol Genet 6:1891–1897PubMedCrossRefGoogle Scholar
  10. Box N, Duffy D, Irving R, Russell A, Chen W, Griffyths L, Parsons P, Green A, Sturm R (2001) Melanocortin-1 receptor genotype is a risk factor for basal and squamous cell carcinoma. J Invest Dermatol 116:224–229PubMedCrossRefGoogle Scholar
  11. Brauer G, Chopra V (1978) Estimation of the heritability of hair and iris color. Anthropol Anz 36:109–20PubMedGoogle Scholar
  12. Brilliant M (2001) The mouse p (pink-eyed dilution) and human P genes, oculocutaneous albinism type 2 (OCA2), and melanosomal pH. Pigment Cell Res 14:86–93PubMedCrossRefGoogle Scholar
  13. Brues A (1975) Rethingking human pigmentation. Am J Phys Anthropol 43:387–391PubMedCrossRefGoogle Scholar
  14. Cavalli-Sforza L, Bodmer W (1999) The genetics of human populations. Dover, Mineola pp 45–59Google Scholar
  15. Chakraborty R (1986) Gene admixture in human populations: models and predictions. Yearbook Phys Anthropol 29:1–43CrossRefGoogle Scholar
  16. Chintamaneni C, Ramsay M, Colman M, Fox M, Pickard R, Kwon B (1991) Mapping the human CAS2 gene, the homologue of the mouse brown (b) locus, to human chromosome 9p22-pter. Biochem Biophys Res Commun 178:227–235PubMedCrossRefGoogle Scholar
  17. Choudhry S, Coyle N, Tang H, Salari K, Lind D, Clark S, Tsai H, Naqvi M, Phong A, Ung N, Matallana H, Avila P, Casal J, Torres A, Nazario S, Castro R, Battle N, Perez-Stable E, Kwok P, Sheppard D, Shriver M, Rodriguez-Cintron W, Risch N, Ziv E, Buchard E (2006) Population stratification confounds genetic association studies among Latinos. Hum Genet 118:652–664PubMedCrossRefGoogle Scholar
  18. Devilly G (2005) The odds ratio generator for windows: version 1.0 (computer programme). Centre for Neuropsychology, Swinburne University, AustraliaGoogle Scholar
  19. Duffy D, Montgomery G, Chen W, Zhao Z, Le L, James M, Hayward N, Martin N, Sturm R (2007) A three-single-nucleotide polymorphism haplotype in intron 1 of OCA2 explains most human eye-color variation. Am J Hum Genet 80(2):241–52PubMedCrossRefGoogle Scholar
  20. Durham-Pierre D, Gardner J, Nakatsu Y, King R, Francke U, Ching A, Aquaron R, del Marmol V, Brilliant M (1994) African origin of an intragenic deletion of the human P gene in tyrosinase positive oculocutaneous albinism. Nat Genet 7:176–179PubMedCrossRefGoogle Scholar
  21. Durham-Pierre D, King R, Naber J, Laken S, Brilliant M (1996) Estimation of carrier frequency of a 2.7 kb deletion allele of the P gene associated with OCA2 in African-Americans. Hum Mutat 7:370–373PubMedCrossRefGoogle Scholar
  22. Eiberg H, Mohr J (1996) Assignment of genes coding for brown iris colour (BEY2) and brown hair colour (HCL3) on chromosome 15q. Eur J Hum Genet 4:237–241PubMedGoogle Scholar
  23. Fernandez J, Shriver M, Beasley M, Rafla-Demetrious N, Parra E, Albu J, Nicklas B, Ryan A, McKeigue P, Hoggart C, Weinsier R, Alliston D (2003) Association of African genetic admixture with resting metabolic rate and obesity among women. Obes Res 11:904–911PubMedGoogle Scholar
  24. Flanagan N, Healy E, Ray A, Philips S, Todd C, Jackson I., Birch-Machin M, Rees J (2000) Pleiotropic effects of the melanocortin 1 receptor (MC1R) gene on human pigmentation. Hum Molec Genet 9:2531–2537PubMedCrossRefGoogle Scholar
  25. Frudakis T, Thomas M, Gaskin Z, Venkateswarlu K, Chandra S, Ginjupalli S, Gunturi S, Natrajan S, Ponnuswamy V, Ponnuswamy K (2003) Sequences associated with human iris pigmentation. Genetics 165:2071–2083PubMedGoogle Scholar
  26. Frudakis T (2005) Powerful but requiring caution: genetic tests of ancestral origins. Nat General Soc Quart 93:260–268Google Scholar
  27. Frudakis T (2007) Molecular Photofitting: the inference of phenotype from DNA. Elsiever/Academic, New York (in press)Google Scholar
  28. Gardner J, Nakatsu Y, Gondo Y, Lee S, Lyon M, King R, Brilliant M (1992) The mouse pink-iris dilution gene: association with human Prader-Willi and Angelman syndromes. Science 257:1121–1124PubMedCrossRefGoogle Scholar
  29. Halder I, Shriver M, Thomas M, Fernandez J, Frudakis T (2006) A panel of ancestry informative markers for estimating individual biogeographical ancestry and admixture from four continents: utility and applications. Hum Genet (In review)Google Scholar
  30. Hamabe J, Fukushima Y, Harada N, Abe K, Matsuo N, Nagai T, Yoshioka A, Tonoki H, Tsukino R, Niikawa N (1991) Molecular study of the Prader-Willi syndrome: deletion, RFLP, and phenotype analyses of 50 patients. Am J Med Genet 41:54–63PubMedCrossRefGoogle Scholar
  31. Helleman J, Jansen M, Span P, van Staveren I, Massuger L, Meijer-van Gelder M, Sweep F, Ewing P, van der Burg M, Stoter G, Nooter K, Berns E (2006) Molecular profiling of platinum resistant ovarian cancer. Int J Cancer 118:1963–1971PubMedCrossRefGoogle Scholar
  32. Hoggart C, Parra E, Shriver M, Bonilla C, Kittles R, Clayton D, McKeigue P (2003) Control of confounding of genetic associations in stratified populations. Am J Hum Genet 72:1492–1504PubMedCrossRefGoogle Scholar
  33. Imesch P, Wallow I., Albert D (1997) The color of the human iris: a review of morphologic correlates and of some conditions that affect iridial pigmentation. Surv Ophthalmol 2(41Suppl):S117–S123CrossRefGoogle Scholar
  34. Lao O, de Gruijter J, van Dujin K, Navarro A, Kayser M (2007) Signatures of positive selection in genes associated with human skin pigmentation as revealed from analyses of single nucleotide polymorphisms. Ann Hum Genet 71:354–369Google Scholar
  35. McKeigue P (1997) Mapping genes underlying ethnic differences in disease risk by linkage disequilibrium in recently admixed populations. Am J Hum Genet 60:188–196PubMedGoogle Scholar
  36. McKeigue P, Carpenter J, Para E, Shriver M (2000) Estimation of admixture and detection of linkage in admixed populations by a Bayesian approach: application to the African-American populations. Ann Hum Genet 64(pt2):171–186PubMedCrossRefGoogle Scholar
  37. Meinhold-Heerlein I, Bauerschlag D, Hilpert F, Dimitrov P, Sapinoso L, Orlowska-Volk M, Bauknecht T, Park T, Jonat W, Jacobsen A, Sehouli J, Luttges J, Drajewski M, Krajewski S, Reed J, Arnold N, Hampton G (2005) Molecular and prognostic distinction between serous ovarian carcinomas of varying grade and malignant potential. Oncogene 34:1053–1065CrossRefGoogle Scholar
  38. Molokhia M, Hoggart C, Patrick A, Shriver M, Parra E, Ye J, Silman A, McKeigue P (2003) Relation of risk of systemic lupus erythematosus to west African admixture in a Caribbean population. Hum Genet 112:301–308Google Scholar
  39. Oetting W, Gardner J, Fryer J, Ching A, Durham-Pierre D, King R, Brilliant M (1998) Mutations of the human P gene associated with Type II oculocutaneous albinism (OCA2). Hum Mutat 12:434PubMedCrossRefGoogle Scholar
  40. Ooi C, Moreira J, Dell’Angelica E, Poy G, Wassarman D, Bonifacino J (1997) Altered expression of a novel adaptin leads to defective pigment granule biogenesis in the Drosophila iris color mutant garnet. EMBO J 16:4508–4518PubMedCrossRefGoogle Scholar
  41. Ouellet V, Provencher D, Maugard C, Le Page C, Ren F, Lussier C, Novak J, Ge B, Hudson T, Tonin P, Mes-Masson A (2005) Discrimination between serous low malignant potential and invasive epithelial ovarian tumors using molecular profiling. Oncogene 24:4672–4687PubMedCrossRefGoogle Scholar
  42. Parra E, Kittles R, Shriver M (2004) Implications of correlations between skin color and genetic ancestry for biomedical research. Nat Genet 36:S54–S60PubMedCrossRefGoogle Scholar
  43. Parra E, Marcini A, Akey J, Martinson J, Batzer M, Cooper R, Forrester T, Allison D, Deka R, Ferrell R, Shriver M (1998) Estimating African American admixture proportions by use of population-specific alleles. Am J Hum Genet 63:1839–1851PubMedCrossRefGoogle Scholar
  44. Pfaff C, Parra E, Bonilla C, Hiester K, McKeigue P, Kamboh M, Hutchinson R, Ferrell R, Boerwinkle E, Shriver M (2001) Population structure in admixed populations: effect of admixture dynamics on the pattern of linkage disequilibrium. Am J Hum Genet 68:198–207PubMedCrossRefGoogle Scholar
  45. Posthuma D, Visscher P, Willemsen G, Zhu G, Martin N, Slagboom P, de Geus E, Boomsma D (2006) Replicated linkage for eye color on 15q using comparative ratings of sibling pairs. Behav Genet 36:12–17PubMedCrossRefGoogle Scholar
  46. Prota G, Hu D, Vincensi M, McCormick S, Napolitano A (1998) Characterization of melanins in human irides and cultured uveal melanocytes from eyes of different colors. Exp Eye Res 67:293–299PubMedCrossRefGoogle Scholar
  47. Purcell S, Cherny S, Sham P (2003) Genetic power calculator: design of linkage and association genetic mapping studies of complex traits. Bioinformatics 19:149–150PubMedCrossRefGoogle Scholar
  48. Puri N, Gardner J, Brilliant M (2000) Aberrant pH of melanosomes in pink-eyed dilution (p) mutant melanocytes. J Invest Dermatol 115:607–613PubMedCrossRefGoogle Scholar
  49. Ramaswamy S, Ross K, Lander E, Golub T (2003) A molecular signature of metastasis in primary solid tumors. Nature Gen 33:49–54CrossRefGoogle Scholar
  50. Reiner A, Ziv E, Lind D, Nievergelt C, Schork N, Cummings S, Phong A, Burchard E, Harris T, Psaty B, Kwok P (2005) Population structure, admixture, and aging-related phenotypes in African American adults: the cardiovascular health study. Am J Hum Genet 76:463–77PubMedCrossRefGoogle Scholar
  51. Rinchik E, Bultman S, Horsthemke B, Lee S, Strunk K, Spritz R, Avidano K, Jong M, Nicholls R (1993) A gene for the mouse pink-eyed dilution locus and for human type II oculocutaneous albinism. Nature 361:72–76PubMedCrossRefGoogle Scholar
  52. Robbins L, Nadeau J, Johnson K, Kelly M, Roselli-Rehfuss L, Baack E, Mountjoy K, Cone R (1993) Pigmentation phenotypes of variant extension locus alleles result from point mutations that alter MSH receptor function. Cell 72:827–834PubMedCrossRefGoogle Scholar
  53. Rosenberg N, Mahajan S, Ramachandran S, Zhao C, Pritchard J, Feldman M (2005) Clines, clusters, and the effect of study design on the inference of human population structure. PLoS Genet 1:e70PubMedCrossRefGoogle Scholar
  54. Rosenberg N, Pritchard J, Weber J, Cann H, Kidd K, Zhivotovsky L, Feldman M (2002) Genetic structure of human populations. Science 298:2381–2385PubMedCrossRefGoogle Scholar
  55. Schneider S, Roessli D, Excoffier L (2000) Arlequin ver. 2.000: a software for population genetics data analysis. Genetics and Biometry Laboratory, University of Geneva, SwitzerlandGoogle Scholar
  56. Shriver M, Parra E, Dios S, Bonilla C, Norton H, Jovel C, Pfaff C, Jones C, Massac A, Cameron N, Baron A, Jackson T, Argyropoulos G, Jin L, Hoggart C, McKeigue P, Kittles R (2003) Skin pigmentation, biogeographical ancestry and admixture mapping. Hum Genet 112:388–399Google Scholar
  57. Shriver M, Mei R, Parra E, Sonpar V, Halder I, Tishkoff S, Schurr T, Zhadanov S, Osipova L, Brutsaert T, Friedlaender J, Jorde L, Watkins W, Bamshad M, Gutierrez G, Loi H, Matsuzaki H, Kittles R, Argyropoulos G, Fernandez J, Akey J, Jones K (2005) Large-scale SNP analysis reveals clustered and continuous patterns of human genetic variation. Hum Genom 2:81–89Google Scholar
  58. Shriver M, Kittles R (2004) Genetic ancestry and the search for personalized genetic histories. Nat Rev Genet 5:611–618PubMedCrossRefGoogle Scholar
  59. Smith R, Healy E, Siddiqui S, Flanagan N, Steijlen P, Rosdahl I, Jacques J, Rogers S, Turner R, Jackson I, Birch-Machin M, Rees J (1998) Melanocortin 1 receptor variants in an Irish population. J Invest Derm 111:119–122PubMedCrossRefGoogle Scholar
  60. Smith M, Patterson N, Lautenberger J, Truelove A, McDonald G, Waliszewska A, Kessing B, Malasky M, Scafe C, Le E, De Jager P, Mignault A, Yi Z, The G, Essex M, Sankale J, Moore J, Poku K, Phair J, Goedert J, Vlahov D, Williams S, Tishkoff S, Winkler C, De La Vega F, Woodage T, Sninsky J, Hafler D, Altshuler D, Gilvert D, O’Brien S, Reich D (2004) A high-density admixture map for disease gene discovery in African Americans. Am J Hum Genet 74:1001–1013PubMedCrossRefGoogle Scholar
  61. Stephens M, Smith N, Donnelly P (2001) A new statistical method for haplotype reconstruction from population data. Am J Hum Genet 68:978–989PubMedCrossRefGoogle Scholar
  62. Stephens M, Donnelly P (2003) A comparison of Bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet 73:162–1169CrossRefGoogle Scholar
  63. Sturm R, Frudakis T (2004) Eye colour: portals into pigmentation genes and ancestry. Trends Genet 20:327–332PubMedCrossRefGoogle Scholar
  64. Sturm R, Teasdale R, Box N (2001) Human pigmentation genes: identification, structure and consequences of polymorphis variation. Gene 277:49–62PubMedCrossRefGoogle Scholar
  65. Terwilliger J, Goring H (2000) Gene mapping in the 20th and 21st centuries: statistical methods, data analysis, and experimental design. Hum Biol 72:63–132PubMedGoogle Scholar
  66. Voight B, Kudaravalli S, Wen X, Pritchard J (2006) A map of recent positive selection in the human genome. PLoS Biol 4:e72PubMedCrossRefGoogle Scholar
  67. Yang N, Li H, Criswell L, Gregersen P, Alarcon-Riquelme M, Kittles R, Shigata R, Silva G, Patel P, Belmont J, Seldin M (2005) Examination of ancestry and ethnic affiliation using highly informative diallelic DNA markers: application to diverse and admixed populations and implications for clinical epidemiology and forensic medicine. Hum Genet 118:382–392PubMedCrossRefGoogle Scholar
  68. Zhu G, Evans D, Duffy D, Montgomery G, Medland S, Gillespie N, Ewen K, Jewell M, Liew Y, Hayward N, Sturm R, Trent J, Martin N (2004) A genome scan for eye color in 502 twin families: most variation is due to a QTL on chromosome 15q. Twin Res 7:197–210PubMedCrossRefGoogle Scholar
  69. Zhu X, Luke A, Cooper R, Quertermous T, Hanis C, Mosley T, Gu C, Tang H, Rao D, Risch N, Welder A (2005) Admixture mapping for hypertension loci with genome-scan markers. Nat Genet 37:177–181PubMedCrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2007

Authors and Affiliations

  • Tony Frudakis
    • 1
  • Timothy Terravainen
    • 2
  • Matthew Thomas
    • 1
  1. 1.DNAPrint Genomics, IncSarasotaUSA
  2. 2.Department of StatisticsColumbia UniversityNew YorkUSA

Personalised recommendations