Strategies and issues in the detection of pathway enrichment in genome-wide association studies
- First Online:
- Cite this article as:
- Hong, M., Pawitan, Y., Magnusson, P.K.E. et al. Hum Genet (2009) 126: 289. doi:10.1007/s00439-009-0676-z
- 420 Views
A fundamental question in human genetics is the degree to which the polygenic character of complex traits derives from polymorphism in genes with similar or with dissimilar functions. The many genome-wide association studies now being performed offer an opportunity to investigate this, and although early attempts are emerging, new tools and modeling strategies still need to be developed and deployed. Towards this goal, we implemented a new algorithm to facilitate the transition from genetic marker lists (principally those generated by PLINK) to pathway analyses of representational gene sets in either threshold or threshold-free downstream applications (e.g. DAVID, GSEA-P, and Ingenuity Pathway Analysis). This was applied to several large genome-wide association studies covering diverse human traits that included type 2 diabetes, Crohn’s disease, and plasma lipid levels. Validation of this approach was obtained for plasma HDL levels, where functional categories related to lipid metabolism emerged as the most significant in two independent studies. From analyses of these samples, we highlight and address numerous issues related to this strategy, including appropriate gene based correction statistics, the utility of imputed versus non-imputed marker sets, and the apparent enrichment of pathways due solely to the positional clustering of functionally related genes. The latter in particular emphasizes the importance of studies that directly tie genetic variation to functional characteristics of specific genes. The software freely provided that we have called ProxyGeneLD may resolve an important bottleneck in pathway-based analyses of genome-wide association data. This has allowed us to identify at least one replicable case of pathway enrichment but also to highlight functional gene clustering as a potentially serious problem that may lead to spurious pathway findings if not corrected.
The extent to which common genetic polymorphism contributes to variance in complex human traits has been explored for decades but it has only recently become possible to perform hypothesis-free studies at fine scale on a genome-wide level (Klein et al. 2005). These studies strive to reveal the underlying genetic architecture of complex human diseases and quantitative traits and although the genetic effect sizes have typically been small, the statistical evidence implicating individual genes has been strong (Barrett et al. 2008; Willer et al. 2008; Zeggini et al. 2008). However, most genome-wide association studies performed to date depict only a few significant loci (http://www.genome.gov/gwastudies/). Among the many remaining questions is whether or not data from these studies can also be used to generate additional insight into the biological pathways that influence disease. If the suggestion that complex trait genetics follows an L-shaped distribution of effect sizes is true (Dixon et al. 2007; Sing and Boerwinkle 1987), then it might be possible to explore the rightmost tail of the distribution for significant genes that share common functions. This represents a natural extension of the approach used in gene expression profiling, where pathway analyses based upon controlled gene descriptions, such as from the Kyoto Encyclopedia of Genes and Genomes (KEGG) or the Gene Ontology (GO) projects, is common practice (Ashburner et al. 2000; Goto et al. 1997). As reflected by the recentness of genome-wide association studies themselves, attempts to consider pathways are also relatively new. The first study to explore for pathway enrichment from genome-wide marker data was based upon relatively small samples on Parkinson’s disease and age-related macular degeneration (Wang et al. 2007). This was followed up by a second paper based upon several human diseases from the larger public Wellcome-Trust Case-Control Consortium data (WTCCC) (Torkamani et al. 2008). The use of pathway modeling has also been an additional analytical component to some of the primary genome-wide association studies, the first of which being a study of human height by Gudbjartsson et al. (2008). There are now several recent additional examples (Askland et al. 2009; Baranzini et al. 2009; Vink et al. 2009; Wang et al. 2009), and with the exception of one study on type 2 diabetes that reports negative findings (Perry et al. 2009), all previous studies have claimed significant evidence of pathway enrichment.
A principal bottleneck in these kinds of analyses, in contrast to gene expression profiling, is that genome-wide association studies produce genetic marker lists, not gene lists, and for the latter, pathway annotations exist. The reliable conversion of markers to representative genes is not trivial and shares the same inherent ambiguity as any genetic association study where a genomic region is implicated in disease or quantitative trait variation and where linkage disequilibrium (LD) remains a vital consideration. We thus set out to create an application that would allow the automatic conversion of any marker list in flat text file format to a representative gene list, also taking into account LD. Our principal target was output from PLINK (Purcell et al. 2007) which is at present the most commonly used software for performing the statistical testing of markers from genome-wide association studies. In the present study, we have applied our software to several large genome-wide association data sets and used the results as a foundation to examine a number of issues that may have relevance for the success of this strategy. We obtained relatively strong evidence validating the importance of lipid related pathways in the determination of plasma lipid levels, but we also highlight a key result that positional gene clustering, if not corrected for, can lead to spurious results.
Materials and methods
From SNP lists to Gene lists
A Perl program, named ProxyGeneLD, was developed to automate the assignment of genotyped SNPs from genome-wide association studies to specific genes, flexibly taking into consideration linkage disequilibrium (LD). This was based upon several assumptions. First, a functional SNP is more likely to affect the gene if the variant is located within its open reading frame or in the promoter binding region. Second, the number of SNPs derived from HapMap phase II is large enough to provide coverage of most genes annotated in the various databases describing pathway terms (e.g. Gene Ontology). Third, the program also assumes that the number of common non-HapMap SNPs that are solely located in a gene and in strong LD with a directly observed SNP is negligible.
The program requires five different data files from public databases, which were obtained as follows. The positional data of the entire set of validated polymorphic SNPs from CEU samples of HapMap phase II (release 22) and the file containing NCBI RefSeq transcript locations were attained using the UCSC table browser tool (hg18). Information on pair-wise LD estimates for all HapMap CEU SNPs was downloaded from the official HapMap homepage (release 23a). In the program, gene definitions followed Entrez Gene database conventions and GeneIDs were used as reference identification numbers. Because GSEA-P, one of the downstream applications we utilized, accepts only gene symbols, a step to convert Entrez GeneIDs to gene symbols was also implemented in our program. The data files for GeneIDs and linked official gene symbols were downloaded from Entrez Gene ftp. All of the aforementioned five data files were preprocessed creating more compact files in order to speed up the program. The program initiates by reading the output data file of original genome-wide association study (GWAS) that is tab-, space-, or comma-delimited text files and which contains list of SNPs and corresponding P values.
Automated trimming of gene clusters
A gene cluster was defined as a gene set for which the most significant SNP assigned to each gene is a member of the same proxy cluster. In other words, signals from all genes in a gene cluster had a high likelihood of originating from a single association between one marker and phenotype. The program automatically provides a column showing for each gene any other gene(s) that were indiscernible members of the same cluster. As an optional step, it additionally creates one more list of genes in which clusters are sorted out, whereby the gene with the highest ranked SNP is retained and the remaining genes moved to bottom of the list. A note of caution with this is that it might not be appropriate to use this function for cases in which rank is important (e.g. for GSEA-P).
Gene set enrichment analysis (GSEA v2.0) is implemented as a JAVA application named GSEA-P and requires a text file containing gene symbols and a weight for gene ranking. For our analyses, gene ranks were calculated by; (weight) = −log10(adjusted gene-wide P value). Since this produces a large number of negative values due to gene-based adjustment, to retain the relative rank positions for all genes beyond the inflection a linear scale ending at zero was created. All reported results using GSEA are based upon 5,000 permutations, P = 1 weighting, maximum set size of 500 and minimum set size of 15. A false discovery rate <25% was used as an indication of potentially interesting findings.
Threshold-based ontology analysis (DAVID)
DAVID is one of the many publicly available web-applications for searching for over-represented pathways in gene lists. It provides extensive options for the interrogation of approximately 40 alternative databases (e.g. KEGG, Interpro, Pfam), but there is a large degree of redundancy when multiple data sources are tested and GO has the greatest number of gene annotation records. Even for GO annotations, redundant terms arise with only marginal differences in the component gene lists. For this reason, for all analyses using DAVID results are presented based upon the use of functional annotation clustering and only the top five single terms from among clusters are reported. For presentation purposes, this was truncated at a maximum of five results. Analyses presented here used DAVID with a March 2008 GO annotation update.
Ingenuity pathway analysis (IPA; Ingenuity Systems)
IPA is a commercial web-delivered application implemented in JAVA of which one of functions is to calculate the probability of observing an association between certain gene sets and pathways by random chance. It applies the right-tailed Fisher’s Exact test and Benjamini–Hochberg method of multiple testing correction (Benjamini and Hochberg 1995) to find over-represented pathways in the gene set comparing to a reference set, which in our analyses is the total genes monitored in each original GWAS. In IPA, annotations of genes to pathways are based on the Ingenuity Knowledge Base, that was built on the extracted findings in major life sciences literature and data in established public databases such as GO and EntrezGene.
For various analyses, gene length was taken as the average of all known splice-forms according to RefSeq of NCBI build 36.1. Total unadjusted SNP number for each gene was established according to the genomic interval spanning the longest known splice-form. Analyses involving gene ranks were performed using the Mann-Whitney U test. Gene length, number of markers, and unadjusted genome-wide P values were log10-transformed prior to simple linear regression.
The software ProxyGeneLD is available for free download at http://ki.se/ki/jsp/polopoly.jsp?d=26072&l=en. It consists of the main program proxyGeneLD.pl, and two additional programs for preprocessing.
DAVID 2008: http://david.abcc.ncifcrf.gov/; Entrez Gene download data: ftp://ftp.ncbi.nlm.nih.gov/gene/; Gene Ontology (GO): http://www.geneontology.org/; HapMap bulk data download, LD data: http://ftp.hapmap.org/ld_data/?N=D; UCSC genome browser: http://genome.ucsc.edu/; Ingenuity Pathway Analysis software: http://www.ingenuity.com/; GSEA-P: http://www.broad.mit.edu/GSEA
We developed software to automate the process of converting genome-wide SNP lists to gene lists, beginning with the retrieval of LD structures in analogous populations with denser genotyping data (i.e. HapMap). If a group of markers was in high LD in HapMap, we tied them to a “proxy cluster” treating it as a single signal. Then, each marker in the original SNP list with statistically significant evidence of association with a phenotype was evaluated to see (a) if it belonged to any proxy cluster and (b) if the marker itself or any marker in the cluster was located in a genic region. Any marker or cluster that overlapped a region extending across a gene was assigned as a signal indicating the possible association of that gene. In order to correct the multiple-testing problem that emerges due to multiple signals across a gene, the P value for each gene was adjusted by multiplication of the lowest P value of the assigned signals by the number of signals. A more precise description of the algorithm is presented in the methods section and an example is illustrated in Fig. 1.
There were three core issues that we considered essential to investigate prior to attempting pathway enrichment analyses on disease or trait results from genome-wide association studies (1) the extent of bias introduced by the higher likelihood of low P values in long genes (or genes with many tested polymorphisms) (2) the potential local clustering of genes with similar function and (3) differences in marker lists, in particular as it relates to imputed vs. non-imputed data sets (Marchini et al. 2007).
Genome-wide association studies
GO terms enriched among long genes
Genes with term
HDL trait P value
Plasma membrane part
Nervous system development
GO terms enriched for plasma HDL levels
Genes with term
Response to external stimulus
Fatty acid metabolic process
Cellular lipid metabolic process
Specific genes contributing to enrichment of the term “cellular lipid metabolic process”
Rank in GWAS
P value (original)
GO terms enriched for plasma LDL levels
Genes with term
Lipid transporter activity
Nucleic acid metabolic process
Macromolecular complex assembly
Specific genes contributing to enrichment of the term “sterol transport”
Rank in GWAS
P value (original)
GO terms enriched for type 2 diabetes
Genes with term
Response to nutrient
Advanced glycation end-product receptor activity
Cellular component assembly
Specific genes contributing to enrichment of the term “microtubule-based process”
Rank in GWAS
P value (original)
GO terms enriched for Crohn’s disease
P value pre-trim
P value post-trim (genes)
Fold change post-trim
MHC protein complex
Regulation of apoptosis
Specific genes contributing to enrichment of the term “immune response” after trimming
Rank in GWAS
P value (unadjusted)
GO term enrichment for HDL from GSEA analyses
P value—sample 1
P value w/o CETP
Lipid transporter activity
Specific genes contributing to enrichment of the term “lipid transport”
Rank in GWAS
Rank Metric Score
Finally, we sought evidence of replication for the above analyses by focusing on a second independent sample with plasma lipid traits from a recent genome-wide association study (Table 1; sample 4). We decided to focus on IPA for this analysis, given its somewhat better apparent performance compared to the other approaches above. We were unable to validate the result for LDL levels, with no functional categories with three or more genes achieving P < 0.05. However, for HDL levels, results were highly consistent with the data from sample 1, with multiple categories replicating between the two sets and with “lipid metabolism” being the most significant high-level category in both samples. The best evidence of replication was obtained for the lower-level term “homeostasis of cholesterol” with P = 3.3 × 10−4 for sample 1 and P = 5.3 × 10−4 for sample 4 (using Fisher’s combined probability test gave P = 2.3 × 10−6). This represented the best study-wide significance across all tests conducted.
We developed and implemented a new software program to facilitate the conversion of genetic marker lists to gene lists, with the primary goal of uncovering evidence of pathway enrichment from genome-wide association studies. To accomplish this, only the largest of the many available genome-wide association study data sets were targeted, with an initial focus on plasma lipid levels that we thought might have a detectable pathway basis. We obtained a reasonably high level of statistical evidence by replicating an enrichment of lipid related terms for plasma HDL levels in two independent samples. We consider, however, the more important aspect of this work to be the broader exploration of various parameters that may affect the validity of this strategy.
A central issue in approaches such as ours is how to best obtain representation of genes from markers. Our study is distinct from previous efforts to apply pathway approaches to genome-wide data, in that we treat gene representation as a core issue. For example, the original study by Wang lists only the assignment of markers within 500 kb of genes, without further detail (Wang et al. 2007). The second study by Torkamani provides more information, linking SNPs to genes with 5 kb, but there they invoked a hierarchy giving more weight to coding SNPs and they didn’t specifically address the issue of LD (Torkamani et al. 2008). Regardless, any strategy like that presented will both increase gene diversity in a region of a single association signal, potentially causing pathway dilution, and create the possible “appearance” of enrichment due to the positional clustering of functionally related genes. For the former, we noted that across the various data sets, the inflation factor was around 0.5, or on average 50% more genes than would be expected if LD was r2 ≥ 0.8 between the markers showing the highest significance. The consequence of this will be to increase type II error. The latter problem, however, in our view, is much more serious and a few cases are exemplified where the statistical support decreases following correction for clustering. Although we attempt to resolve clustering with both manual and/or automatic curation, the only real solution will require the identification of true functional variants that can be shown to influence specific genes. Thus, functional studies on gene regulation, including identifying markers that directly affect gene expression and can be tied to a specific gene are extremely important for pathway-based analyses. There are a number of impressive in silico tools now appearing for marker and gene prioritization that may also aid this endeavor (Chen et al. 2008; Gaulton et al. 2007; Ge et al. 2008; Pico et al. 2009; Tranchevent et al. 2008). The many published genome-wide data sets also lend themselves to strategies to test for marker independence (Sun et al. 2008), but there will also be merit in exploring alternative human populations, where LD is less of a problem (Cox et al. 2002). For the time being, we acknowledge that our algorithm leads to the inclusion of an excess of genes around significant signals, but we do note that the software includes options for changing LD thresholds. Nonetheless, in terms of pathway analyses it may be argued that results of enriched terms where the contributing genes all reside in independent loci may be regarded as more reliable than those where the genes are clustered, even if not in strict LD.
We think there is reason to be cautious about biological interpretations of pathway results across the data sets here and elsewhere. The space of variables that can be tuned for these kinds of analyses is quite large, ranging from the statistics and covariates chosen in the original genome-wide studies to which pathway enrichment tool/statistic is used. This is particularly important given the relatively large number of analysis programs now available, including those used here (DAVID, GSEA-P, and IPA). Against this background, the possible enrichment of terms associated with lipid metabolism in two independent data sets for HDL levels represents a reasonable validation of this approach. However, this statistical evidence should be taken in context with that of the most significant individual gene (CETP; P ≈ 10−20). If this is an indication that the evidence one can expect from pathway analyses is likely to be weaker than for individual genes, then even larger samples than those currently employed may be required for detection. An interesting contrast to the HDL result is the essential lack of evidence in the Crohn’s disease sample. In particular, the role of the immune system for this disease is well documented (Lettre and Rioux 2008) yet is only marginally implicated by our pathway analysis. The result however reiterates a classic problem in genetics, namely the possibility that there are multiple independent functional variants acting in the HLA region that are indiscernible due to LD. This may have implications for other immune diseases, such as rheumatoid arthritis (Raychaudhuri et al. 2008). The result for diabetes yielded yet a third contrasting scenario, where there was a suggestion of the involvement of microtubule genes but no confounding by positional clustering. An intriguing aspect of this result was the inclusion of the KIF11 gene, which resides in an LD block that includes HHEX and IDE. We have focused on IDE in the past (Gu et al. 2004), and others have highlighted maximum signals nearer HHEX (Zeggini et al. 2007). Still, though there is no strong precedent for the involvement of kinesin proteins in type 2 diabetes from the literature, the additional information included by a pathway analysis may help to prioritize specific genes for further functional assessment.
At the outset of this investigation we were interested in what the evolutionary implications might be should evidence of pathway enrichment emerge. Given that both allelic and locus heterogeneity act to influence phenotypes, a natural concern was that population differences could contribute to apparent enrichment. Thus, two genes with similar annotations and acting detectably on a phenotype but exclusively in separate sub-populations would be considered together. For genes such as CETP this is unlikely to be the case since it has been demonstrated to be highly significant in numerous populations, but for genes farther down in rank it may be. This ultimately represents a trade-off in terms of obtaining high statistical power for single marker analyses (via meta-analysis) and the risk of enriching a pathway due to locus heterogeneity across different populations. Nonetheless, the assumption that diseases should arise by mechanisms involving similar genes has long been one the corner-stones of genetic association studies. Complex phenotypes have evolved to be resilient to insult (Buchanan et al. 2006), and the emergence of modest to weak effects across multiple genes in a pathway might be seen to support this.
In summary, pathway analyses of genome-wide association data based upon controlled gene annotations are now emerging, with many groups claiming evidence of enrichment for a range of phenotypes (Baranzini et al. 2009; Torkamani et al. 2008; Wang et al. 2007). The data presented in this paper also support the prospect that biological pathways can be detected in genome-wide association data, but we have still only scratched the surface of the bulk of genome wide data becoming available and we think there is still reason to be cautious. In particular, the observation that persists both here and in other studies is that there is no case where the evidence for any pathway enrichment exceeds that of the previously reported single locus findings. This can be taken as a strong contrast to gene expression studies, where striking evidence of pathway enrichment often emerged in the absence of obvious single gene effects (Mootha et al. 2003). The application of pathway approaches to genetic marker data is nonetheless relatively new territory and may represent a valuable addition to ongoing genome-wide studies given the vast range of phenotypes now being investigated (http://www.genome.gov/gwastudies/). This needs to be tempered, however, against a number of issues that have previously not been dealt with, the most important in our view being positional gene clustering. The software provided here should nonetheless help to relieve the bottleneck of automating the transition from marker lists to gene lists, and thus expedite further analyses of genome-wide data in a pathway context.
We are greatly indebted to the scientists responsible for making their genome-wide data accessible. We are grateful for financial support from The Swedish Medical Research Council (grant 2007-2722) and the National Institutes of Health (grant AG028555).
Conflict of interest statement
The authors declare no conflict of interest.