Evaluating HapMap SNP data transferability in a large-scale genotyping project involving 175 cancer-associated genes
- First Online:
- Cite this article as:
- Ribas, G., González-Neira, A., Salas, A. et al. Hum Genet (2006) 118: 669. doi:10.1007/s00439-005-0094-9
- 202 Views
One of the many potential uses of the HapMap project is its application to the investigation of complex disease aetiology among a wide range of populations. This study aims to assess the transferability of HapMap SNP data to the Spanish population in the context of cancer research. We have carried out a genotyping study in Spanish subjects involving 175 candidate cancer genes using an indirect gene-based approach and compared results with those for HapMap CEU subjects. Allele frequencies were very consistent between the two samples, with a high positive correlation (R) of 0.91 (P<<1×10−6). Linkage disequilibrium patterns and block structures across each gene were also very similar, with disequilibrium coefficient (r2) highly correlated (R=0.95, P<<1×10−6). We found that of the 21 genes that contained at least one block larger than 60 kb, nine (ATM, ATR, BRCA1, ERCC6, FANCC, RAD17, RAD50, RAD54B and XRCC4) belonged to the GO category “DNA repair”. Haplotype frequencies per gene were also highly correlated (mean R=0.93), as was haplotype diversity (R=0.91, P<<1×10−6). “Yin yang” haplotypes were observed for 43% of the genes analysed and 18% of those were identical to the ancestral haplotype (identified in Chimpazee). Finally, the portability of tagSNPs identified in the HapMap CEU data using pairwise r2 thresholds of 0.8 and 0.5 was assessed by applying these to the Spanish and current HapMap data for 66 genes. In general, the HapMap tagSNPs performed very well. Our results show generally high concordance with HapMap data in allele frequencies and haplotype distributions and confirm the applicability of HapMap SNP data to the study of complex diseases among the Spanish population.
The International HapMap Project commenced in October 2002 and has created great expectations in the biomedical field across the world (The International HapMap Consortium 2003, 2004). It was initiated with the aim of identifying and cataloguing genetic similarities and differences in human beings and will eventually generate validated and publicly available genotype information for around 2.5 mn single nucleotide polymorphisms (SNPs) among four samples from distinct populations of African, Asian and European ancestry. Among the many potential contributions to biomedical research is the application of this information to other populations in general, and to disease populations in particular, thereby assisting in the identification of genes involved in both complex disease and response to therapeutic drugs.
Among Europeans, more than one-third of chromosomes consist of regions of high linkage disequilibrium (LD), or LD blocks (Gabriel et al. 2002; Wall and Pritchard 2003). These regions are characterised by strong association between alleles, low haplotype diversity and low recombination rates (Goldstein 2001). In these high LD regions, it is possible to select SNPs (tagSNPs) that represent or “tag” the common allelic variation and thereby reduce the number of SNPs that need to be genotyped to capture the common variation across the genome.
Several studies have shown that patterns of LD vary between populations and ethnic groups (Abecasis et al. 2001; Zavattari et al. 2000). For this reason, it remains controversial whether the four populations sampled by Hapmap provide information that is transferable to other world populations or whether specific haplotype maps are required for each individual population with a particular demographic history.
In this study, we followed an indirect gene-based approach to SNP selection, targeting a large number of genes potentially involved in cancer. Genes were chosen because they had previously been reported to be implicated in cancer development or progression, because they are involved in fundamental cancer-related pathways, or because their specific function suggested a potential role in cancer susceptibility. The indirect approach was adopted because it requires no prior knowledge or hypotheses regarding causative variants. Furthermore, the approach uses information from a reduced set of SNPs that capture most of the common variation in a gene. The results from this study could therefore be used in SNP- and haplotype-based association studies for complex diseases such as cancer.
In order to explore the applicability of HapMap data to a specific southwest European population (Spain), allele frequencies of selected SNPs and frequencies of haplotypes formed within genes were estimated in Spanish subjects and compared to those published by HapMap for subjects with European ancestry (CEPH study based on Utah residents with ancestry from northern and western Europe), hereafter referred to as CEU data (http://www.hapmap.org). Furthermore, an important test of the utility of HapMap data in their application to other populations is whether tagSNPs provided by HapMap capture the same amount of information on ungenotyped SNPs in a population other than those studied in HapMap. We tested the performance of HapMap tagSNPs for each gene by applying the HapMap tagSNP selection algorithm to our selected SNPs using HapMap CEU data, identifying tagSNPs, and then measuring the degree to which these tag SNPs captured the variation observed in our Spanish sample.
Materials and methods
Blood samples were obtained from 845 healthy Spanish volunteers recruited in Madrid, all of whom gave informed consent. Genomic DNA was isolated from peripheral blood lymphocytes using automatic DNA extraction according to the manufacturer’s recommended protocols (Magnapure, Roche).
Candidate gene choice and polymorphism selection
The 175 candidate genes studied were selected applying the following criteria: genes previously reported to be associated with, or known to be involved in cancer; genes codifying BRCA1/2 binding proteins (dimerization proteins, cofactors, components of the BASC complex, genes involved in BRCA1 pathways); several components of DNA repair pathways [nucleotide excision (NER), double strand breaks (DSB) and mismatch]; and genes involved in chromatin remodelling (helicases, HDAC family), cell cycle pathways (INK, CDK and cyclin families), cell communication (growth factor receptors, MAP kinases and NFKB families), hormone metabolism, apoptosis (BCL and CASP families), carcinogen metabolism (CYP, NAT and GST families), cell adhesion, and signal transduction (RAS and FOS). A full list of these genes is available in Supplementary table 1.
Genes were classified using Gene Ontology categories (The Gene Ontology 2000) with the help of the related programs DAVID (Dennis et al. 2003) and EASE (Hosack et al. 2003). EASE was also used to test for differences in the distribution of GO categories between those genes with SNPs that formed blocks larger than 60 kb and the other genes studied.
SNPs were selected according to density (spaced 10 kb on average) (Ke et al. 2004) and minor allele frequency (greater than 5%). Genotyping was carried out using two high-throughput platforms: Illumina Bead Array System (Illumina Inc., San Diego, USA) (Oliphant et al. 2002) and SNPlex (Applied Biosystems, Foster City, USA) (De la Vega et al. 2005), in each case according to the manufacturer´s protocol. A total of 899 SNPs were genotyped: 552 SNPs (122 genes) in 845 subjects using the Illumina platform; and 347 SNPs (53 genes) in 409 subjects using SNPlex. Of these, 65% were located in intronic regions, 11% in coding regions, 10% 5′-upstream, 10% 3′-downstream, 3% 3utr and 1% 5utr (details in Supplementary table 1).
Pearson’s correlation coefficient (referred to as R, to distinguish it from the disequilibrium coefficient, r2) was used to measure correlations in haplotype and allele frequencies, linkage disequilibrium parameters and haplotype diversity between the two samples.
Minor allele frequencies (MAF) were estimated from Spanish subjects and compared to those estimated by HapMap (based on 60 CEU individuals) via Pearson’s Х2 test or Fisher’s exact test, where appropriate. This takes account of sample sizes and genotyping success and tests whether the Spanish and HapMap samples come from populations with the same underlying allele frequencies. A one sample t-test was also used to test whether the Spanish subjects sampled had allele frequencies equal to those published by HapMap, treating the latter as definitive. Hardy Weinberg (HW) equilibrium was tested using the genhwi command in STATA v8.
A Bonferroni-adjusted nominal P-value threshold of 0.000083 was used to account for multiple testing, based on 602 effective independent marker loci (N*) among the total 899 SNPs studied. N* was calculated by applying the web-based program SNPSpD (Nyholt 2004) to SNPs on individual chromosomes and summing estimates across chromosomes. This approach has been shown to closely approximate results from adjustment for multiple testing using permutation methods (Nyholt 2004). We also used control of the false discovery rate (based on N*) using the procedure of Benjamini and Liu (Benjamini et al. 2001) to control for multiple testing in a less conservative way.
We studied haplotypes in the 101 genes in which at least four SNPs were genotyped. These genes are listed in Supplementary table 2. Haplotypes observed in our Spanish sample were inferred and disequilibrium coefficients (r2) for adjacent SNPs were calculated using Haploview v3.11 (Barrett et al. 2005). We identified “yin yang” haplotype pairs according to the following criteria: at least five SNPs, all with MAF of at least 10% and the less frequent of the “yin yang” haplotype having frequency greater than 3% (Zhang et al. 2003). Haplotype blocks were defined in both samples according to Gabriel et al. (2002), using the default parameters in Haploview v3.11. Haplotype diversity was computed individually for each gene, and for each sample, as h=(1-Σxi2)n/(n−1), where xi was the frequency of a given haplotype and n was the number of samples genotyped (Nei and Tajima 1981).
The Tagger program included in Haploview identifies tagSNPs that optimally capture allelic variation among a set of SNPs above a minimum r2 threshold set by the user. This threshold establishes a minimum for the amount of variation in the non-tagSNPs captured if only the tagSNPs were genotyped. TagSNPs are selected according to the algorithm of Carlson et al. (2004). We assessed HapMap tagSNP portability by, for the selected SNPs in each gene, identifying tagSNPs for r2 thresholds of 0.8 and 0.5 using the HapMap dataset, and then evaluating these tagSNP sets in the Spanish sample. For each non-tagSNP, the r2 value between it and each tagSNP was calculated and the maximum value recorded. Average values and minimum values over all non-tagSNPs in each candidate gene were computed as measures of portability of the HapMap tagSNP set in our sample. We also carried out the evaluation of these tagSNPs for r2=0.8 by applying them to the most recent release of HapMap data with SNPs spaced every 5 kb.
Construction of ancestral (Chimpanzee) haplotypes
The BLAT Search Genome platform (http://www.genome.ucsc.edu/cgi-bin/hgBlat) was used to extract FASTA sequence files for high-quality chimpanzee sequences and align them with human reference sequences. Chimpanzee haplotypes were constructed using the chimpanzee nucleotides that aligned to homologous human SNPs. If a chimpanzee nucleotide did not match either human allele, an ambiguous residue was assigned to its position in the chimpanzee haplotype.
Comparison of allele frequencies
For 317 (35%) of the 899 SNPs considered, one-sample t-tests gave evidence (P<0.000083) that the corresponding allele frequency in the Spanish population differed from that published for the minor allele in HapMap (shown in lighter blue in Fig. 1). However, when the sample sizes used in HapMap and our study were taken into account, only two SNPs had significantly different allele frequencies (P<0.000083, shown in red in Fig. 1). This was the case whether the Bonferroni correction or control of the false discovery rate was used to account for multiple testing. These two SNPs were rs1695144 in the MDM2 gene (HapMap frequency of 0.13 versus 0.01 in Spanish subjects, P=7×10−11) and rs4687098 in the TP73L gene (HapMap frequency of 0.16 vs 0.35 in Spanish subjects, P=8×10−6).
Comparison of linkage disequilibrium patterns
We also compared the haplotype block structures observed in the HapMap (CEU) and Spanish samples. A total of 137 blocks were inferred across the 101 genes using the HapMap data set, compared to 156 for the Spanish sample. The average block size and average marker density within these blocks were very similar in both (32.7 kb and one SNP every 7.5 kb, respectively, for HapMap and 30.3 kb and one SNP every 7.0 kb, respectively, for the Spanish sample). For 67 (67%) genes, at least one block was found to be identical in the two samples. For 47 (70%) of those, this was the only block identified in the gene. The average number of SNPs per block was 3.5 in the Spanish sample and 3.7 in the HapMap CEU sample. However, there were six genes with very large numbers of SNPs per block: BRIP1 and DSS1 had blocks with nine markers each, MSH3 had a block with ten markers, RB1 and RAD54B each had 11 markers in a single block, and FANCC had a block with 16 markers. We found that for the Spanish sample, 21 genes contained at least one haplotype block that was larger than 60 kb. Three of these had much larger blocks: FANCC with 217 kb, BRAF with 192 kb and RB1 with 181 kb. These long block-like structures were also observed in the HapMap CEU data (see Supplementary table 2).
In order to understand more about these 21 genes with large LD blocks, we looked into their Gene Ontology (GO) classifications (Ashburner et al. 2000; The Gene Ontology 2000), and observed that they cover a broad range of biological processes. Nevertheless, the most overrepresented category was “DNA repair” (GO definition: the process of restoring DNA after damage), with nine genes (ATM, ATR, BRCA1, ERCC6, FANCC, RAD17, RAD50, RAD54B and XRCC4) belonging to this category. Indeed, this category appeared to be significantly overrepresented by Fisher’s exact test (P<0.012). While this significance was lost after Bonferroni correction and control of the false discovery rate, these corrections may be overconservative, taking into account that several categories include the same genes, and are, therefore, not independent. In addition, a further two genes, BRIP1 and MSH3 have clear DNA repair functions but were not considered by GO to pertain to this category. We also looked at the recombination rates published for these 21 genes in the deCODE recombination map (Kong et al. 2002) and found that just two were in regions with recombination rates greater than 1 cM/Mb. The average recombination rate was variable among genes, with an extremely low value (<0.1 cM/Mb) at gene BRCA1 (81 kb), intermediate values (0.6 cM/Mb) at genes ATM (108 kb) and PIK3CB (105 kb), and high values (3.2 cM/Mb) at gene BRIP1 (181 kb).
Comparison of haplotypes
Distribution of correlations (R) in haplotype frequencies within genes between HapMap CEU and Spanish data
No. of genes
For the eight (8%) genes named and highlighted in red in Fig. 4, no “common” haplotypes were detected. Seven (BCL2, CCND2, BRAF35, CD44, EGFR, ESR1 and PIK3R1) had a very large number of haplotypes (more than 20) with a frequency greater than 1% but less than 5%. The eighth (TP73L) had less haplotypes (just six) with a frequency greater than 1% (all below 5%), but the majority (85%) had frequencies below 1%. Six of these eight genes had lower correlations in haplotype frequencies (<0.8), as reported above.
We calculated the proportion of haplotypes that were observed or inferred in both the HapMap (CEU) data and the Spanish sample, considering only haplotypes with a frequency of at least 1%. An average of 75% of haplotypes (across all 101 genes) occurred in both samples, with just 11 genes with less than 50% of haplotypes shared. Of these 11, seven (BCL2, CCDN2, CD44, EGFR, ESR1, TP73L and PIK3R1) coincided with those with no “common” haplotypes identified above.
Portability of TagSNPs
We used the Tagger program within Haploview v3.11 to determine tagSNPs among the selected SNPs in each of the remaining 93 genes using the HapMap CEU data, considering disequilibrium coefficient (r2) thresholds of 0.8 and 0.5 (Barrett et al. 2005). For 27 genes, all SNPs were identified as tagSNPs when an r2 threshold of 0.8 was used. For the remaining 66 genes we assessed the portability of the identified tagSNPs by calculating r2 among genotyped SNPs in the Spanish sample, assuming that these tags applied, and comparing these to the thresholds. Detailed results are available in Supplementary table 3. Using a threshold of 0.8, eight of the 66 genes yielded an average r2 in the Spanish sample below the 0.8, and one of these was below 0.7 (RAD17, mean r2=0.49). However, the tagging efficiency for this gene (RAD17) was extremely high for the HapMap data, requiring only one SNP (of six) to be genotyped to capture variation across the entire gene sequence (67 kb). There were three genes for which the minimum value of r2 was well below 0.8: RAD17 (0.35), HIF1A (0.56), XRCC4 (0.60). For the latter two genes, tagging efficiency was also high for the HapMap data, with only two tagSNPs identified out of five for HIF1A (52.7 kb) and seven of 15 for XRCC4 (276.26 kb). When the threshold of 0.5 was used, only one gene had an average r2 below this value but not appreciably so (0.48).
We also evaluated the performance of the tag SNPs determined at the r2 threshold of 0.8 by applying them to the most recent release of HapMap data, with SNPs spaced every 5 kb. The aggregate mean r2 across the 66 genes was 0.72, however, when we considered only SNPs with minor allele frequencies greater than 10%, the aggregate mean increased to 0.81 (data not shown).
There are relatively few studies that have evaluated the applicability of HapMap SNP data to other populations by studying a large number of genes in a large sample. Those that have were not applied to candidate genes for cancer susceptibility, but rather to genes for coronary heart disease (McCarthy et al. 2004), inflammation lipid metabolism and blood pressure (Crawford et al. 2004), drug metabolism (Kamatani et al. 2004) and homocysteine metabolism (Janosikova et al. 2005). Other studies have focussed on candidate disease genes but have tended to either cover a limited number of genes or used small samples from populations with diverse ancestry (Bonnen et al. 2002; Liu et al. 2005; Long et al. 2004; Stephens et al. 2001). To date, only two very recent studies have focussed on European populations (Mueller et al. 2005; Sawyer et al. 2005). To our knowledge, our study represents the first attempt at targeting a large number of cancer susceptibility genes in a large sample from a homogenous population. Our results will, therefore, be important for the incorporation of HapMap SNP data into the optimal design of studies on the genetic determinants of cancer in diverse populations.
Among 899 SNPs in 179 cancer-related genes, allele frequencies were highly correlated between the HapMap (CEU) and Spanish samples. While allele frequencies of 309 SNPs were found to be different from those published in HapMap, only two of those remained so after the HapMap sample size was also taken into account. The difference between these two findings is that, while the former seems to indicate that the published HapMap minor allele frequencies should probably not be taken as definitive (at least for the Spanish population), the latter suggests that, on the whole, the HapMap CEU sample is representative of the Spanish population in terms of allele frequencies and could, therefore, be used in assessing the appropriateness of SNPs for inclusion in association studies.
Linkage disequilibrium (r2) between adjacent SNPs within genes was also highly correlated between the two samples. This result demonstrates that the general pattern of high and low LD regions appears largely consistent in both populations. It is well known that haplotype structure can be broken into a series of discrete blocks of high LD (Daly et al. 2001; Gabriel et al. 2002). It has been reported that approximately 50% of the human genome exists in blocks of around 44 kb among Europeans (Gabriel et al. 2002). Our results are consistent with these findings, with an average LD block length of 30 kb across 101 genes. However, 21 several genes had much larger haplotype blocks (larger than 60 kb) and three had blocks larger than 180 kb. These results are consistent with others reported in the literature, with maximium reported block sizes ranging from 300 kb (Abecasis et al. 2001), through 500 kb (Jorde 2000) to as much as 800 kb (Dawson et al. 2002). It has been suggested that some regions of extended LD may play an important role in determining the genetic basis of human phenotypic differences (Hinds et al. 2005).
Large blocks of LD could also be generated by processes affecting variability in different genomic regions, such as local recombination rates. In fact, 19 of these 21 genes with large haplotype blocks are located in regions showing recombination rates lower than average (according to the deCODE recombination map). Significant inter-allelic associations over large genetic distance might result from the action of natural selection (Abecasis et al. 2001; Cannon 1963; Clark 2003; Hudson 1990; Huttley et al. 1999). On the other hand, this high LD might also be due to epistasis (non-additive interactions between sites). Theoretical models predict that certain kinds of epistatic interactions between sites under balancing selection can generate extremely strong linkage disequilibrium over long regions (Hartl and Clark 1997; Kelly and Wade 2000). GO classification of these 21 genes indicated an over-representation of the ‘DNA repair’ category, a finding that would be consistent with balancing selection acting on these regions. Although interesting, these observations need to be further explored and verified in large samples, preferably from different populations or ethic groups. Whatever the cause of these high LD regions, the cost-benefit of using tagSNPs is clearly advantageous (Thompson et al. 2003).
When the frequencies of haplotypes formed across each of the 101 genes with at least four SNPs genotyped were compared between the HapMap and Spanish samples, correlations were generally very high. Furthermore, “yin yang” haplotypes (where two of the most common haplotypes have complementary alleles at all loci) were identified for 43% of these genes and this was consistent in both samples and in agreement with previous findings (Clark et al. 1998; Daly et al. 2001; Long et al. 2004; Zhang et al. 2003). The presence of yin yang haplotypes seems to be a common phenomenon (Costas et al. 2005; Zhang et al. 2003) which, rather than suggesting the occurrence of natural selection on the genes involved, could be explained by strictly neutral evolution in a well-mixed population. However, while other studies have focussed on genomic regions in general, our study is gene-based and, therefore, offers a more applied context in which to evaluate this hypothesis. We observed that the percentage of ancestral haplotypes (deduced from chimpanzee) that perfectly match with “yin yang” human haplotypes was three times higher than previously reported (18% in the present study compared with 4.8% observed by Zhang et al. 2003). This result leads us to suggest that natural selection might have played a role in promoting this phenomenon, perhaps acting with greater pressure in genes related to disease.
Only two genes (HIF1A and BRAP) had different degrees of haplotype diversity in the Spanish sample relative to the HapMap (CEU) sample. Selection could have favoured common haplotypes. For instance, evidence for purifying selection has been detected in other cancer genes such BRCA1 (Hurst and Pal 2001). However, there is no strong reason to believe that a different selective pressure is acting in the populations from which these samples were drawn. More probable is that genetic drift could have reduced the spectrum of haplotypes for these genes in one of the populations studied relative to the other.
Exceptions to the high concordances observed in the various parameters considered were consistently observed in eight genes (BCL2, CCND2, BRAF35, CD44, EGFR, ESR1, PIK3R1 and TP73L) for which the SNPs genotyped formed no “common” haplotypes. These outlier genes tended to have a very large number of haplotypes, a lower proportion of haplotypes shared between the HapMap (CEU) and Spanish samples and, for those haplotypes that were shared, lower correlation in haplotype frequencies between the two samples. For these eight genes we inferred haplotypes within blocks and compared their estimated frequencies between the two samples, and found a much higher correlation in haplotype frequencies, comparable to those found in the other 93 genes. This therefore confirms that these genes have a more complex structure with low average LD (or LD between only a few consecutive markers within genes), which makes haplotype analysis of the whole gene sequence inappropriate. These results also suggest that more dense SNP sets will be required to characterise the LD patterns for these genes. On the whole, while the marker density used in this study is sufficient to represent the common haplotypes for the majority of the genes studied (Ke et al. 2004), this is not the case for a small proportion of genes with lower LD. This highlights the importance of the current HapMap project with double this density (one SNP every 5 kb) in order to adequately capture the LD information required if these latter genes (or others with similar characteristics) are to be appropriately included in association studies.
The general applicability of the HapMap SNP data has to be confirmed using samples from different geographical regions (The International HapMap Consortium 2003). It has been reported that HapMap tagSNPs can be appropriately applied to British, Norwegian, Finnish, Romanian (Nejentsev et al. 2004) and several other European populations (Mueller et al. 2005). TagSNPs for just one gene (RAD17) performed substantially worse in the Spanish sample than predicted using HapMap data (mean r2=0.49 compared to a threshold of 0.80), however, this was based on the identification of just one tagSNP among six. For the remainder of the 66 genes considered, HapMap tagSNPs based on CEU data performed well in our Spanish sample, indicating that they can also be applied to more geographically extreme European populations.
On evaluating the performance of these tagSNPs when applied to the most recent release of HapMap data (based on a density of one SNP every 5 kb), we observed generally high correlations for almost all non-tagSNPs, with the exception of some rarer variants (frequencies less than 10%). This observation would not imply serious consequences under the “common variant, common disease” hypothesis, which postulates that common diseases may be explained by modest effects of a number of relatively common variants (Pritchard and Cox 2002). However, it appears that the use of tagSNPs based on a 10 kb density map could result in reduced power to identify associations involved with more rare causal variants.
Even slight differences in SNP and haplotype frequencies could affect the power to detect some phenotype–genotype associations in one population relative to another. Such differences may explain failures to replicate positive associations in different populations. We have found very high correlations between HapMap data and our Spanish sample in both SNPs and haplotype frequencies as well as generally very high portability of HapMap tagSNPs to Spanish subjects, indicating that this might be less of an issue for studies confined to European populations. This is consistent with the findings of others (Gonzalez–Neira et al. 2004; Mueller et al. 2005). For those genes presenting clear signs of historical recombination and complex patterns of LD, finer definition of tagSNPs based on denser genotype data will be needed. Additional empirical studies such as the one presented here, will be needed in order to objectively identify the appropriate density of SNPs required for other candidate genes in the study of complex diseases
EB and LPF are funded by the Comunidad Autónoma de Madrid and by the Spanish Ministry of Science and Technology (MCT), respectively. We thank Christian Torrenteras for Illumina platform support and Fatima Mercadillo and for her expert technical skills. We would also like to thank Christopher Philips and Beatriz Sobrino for their technical support with the SNPlex genotyping platform as well as Jorge Amigo for his assistance with the Genotyping Data Formatter software used to parse SNPlex data and to control genotyping errors. The National Genotyping Centre (CeGen) is funded by the Genome Spain Foundation. This study was partially supported by BFI2003-03852, PI020919, PI041313 and PGIDIT02PXIC20804PN.