Human Genetics

, Volume 126, Issue 2, pp 289–301

Strategies and issues in the detection of pathway enrichment in genome-wide association studies

Authors

  • Mun-Gwan Hong
    • Department of Medical Epidemiology and BiostatisticsKarolinska Institutet
  • Yudi Pawitan
    • Department of Medical Epidemiology and BiostatisticsKarolinska Institutet
  • Patrik K. E. Magnusson
    • Department of Medical Epidemiology and BiostatisticsKarolinska Institutet
    • Department of Medical Epidemiology and BiostatisticsKarolinska Institutet
Original Investigation

DOI: 10.1007/s00439-009-0676-z

Cite this article as:
Hong, M., Pawitan, Y., Magnusson, P.K.E. et al. Hum Genet (2009) 126: 289. doi:10.1007/s00439-009-0676-z

Abstract

A fundamental question in human genetics is the degree to which the polygenic character of complex traits derives from polymorphism in genes with similar or with dissimilar functions. The many genome-wide association studies now being performed offer an opportunity to investigate this, and although early attempts are emerging, new tools and modeling strategies still need to be developed and deployed. Towards this goal, we implemented a new algorithm to facilitate the transition from genetic marker lists (principally those generated by PLINK) to pathway analyses of representational gene sets in either threshold or threshold-free downstream applications (e.g. DAVID, GSEA-P, and Ingenuity Pathway Analysis). This was applied to several large genome-wide association studies covering diverse human traits that included type 2 diabetes, Crohn’s disease, and plasma lipid levels. Validation of this approach was obtained for plasma HDL levels, where functional categories related to lipid metabolism emerged as the most significant in two independent studies. From analyses of these samples, we highlight and address numerous issues related to this strategy, including appropriate gene based correction statistics, the utility of imputed versus non-imputed marker sets, and the apparent enrichment of pathways due solely to the positional clustering of functionally related genes. The latter in particular emphasizes the importance of studies that directly tie genetic variation to functional characteristics of specific genes. The software freely provided that we have called ProxyGeneLD may resolve an important bottleneck in pathway-based analyses of genome-wide association data. This has allowed us to identify at least one replicable case of pathway enrichment but also to highlight functional gene clustering as a potentially serious problem that may lead to spurious pathway findings if not corrected.

Introduction

The extent to which common genetic polymorphism contributes to variance in complex human traits has been explored for decades but it has only recently become possible to perform hypothesis-free studies at fine scale on a genome-wide level (Klein et al. 2005). These studies strive to reveal the underlying genetic architecture of complex human diseases and quantitative traits and although the genetic effect sizes have typically been small, the statistical evidence implicating individual genes has been strong (Barrett et al. 2008; Willer et al. 2008; Zeggini et al. 2008). However, most genome-wide association studies performed to date depict only a few significant loci (http://www.genome.gov/gwastudies/). Among the many remaining questions is whether or not data from these studies can also be used to generate additional insight into the biological pathways that influence disease. If the suggestion that complex trait genetics follows an L-shaped distribution of effect sizes is true (Dixon et al. 2007; Sing and Boerwinkle 1987), then it might be possible to explore the rightmost tail of the distribution for significant genes that share common functions. This represents a natural extension of the approach used in gene expression profiling, where pathway analyses based upon controlled gene descriptions, such as from the Kyoto Encyclopedia of Genes and Genomes (KEGG) or the Gene Ontology (GO) projects, is common practice (Ashburner et al. 2000; Goto et al. 1997). As reflected by the recentness of genome-wide association studies themselves, attempts to consider pathways are also relatively new. The first study to explore for pathway enrichment from genome-wide marker data was based upon relatively small samples on Parkinson’s disease and age-related macular degeneration (Wang et al. 2007). This was followed up by a second paper based upon several human diseases from the larger public Wellcome-Trust Case-Control Consortium data (WTCCC) (Torkamani et al. 2008). The use of pathway modeling has also been an additional analytical component to some of the primary genome-wide association studies, the first of which being a study of human height by Gudbjartsson et al. (2008). There are now several recent additional examples (Askland et al. 2009; Baranzini et al. 2009; Vink et al. 2009; Wang et al. 2009), and with the exception of one study on type 2 diabetes that reports negative findings (Perry et al. 2009), all previous studies have claimed significant evidence of pathway enrichment.

A principal bottleneck in these kinds of analyses, in contrast to gene expression profiling, is that genome-wide association studies produce genetic marker lists, not gene lists, and for the latter, pathway annotations exist. The reliable conversion of markers to representative genes is not trivial and shares the same inherent ambiguity as any genetic association study where a genomic region is implicated in disease or quantitative trait variation and where linkage disequilibrium (LD) remains a vital consideration. We thus set out to create an application that would allow the automatic conversion of any marker list in flat text file format to a representative gene list, also taking into account LD. Our principal target was output from PLINK (Purcell et al. 2007) which is at present the most commonly used software for performing the statistical testing of markers from genome-wide association studies. In the present study, we have applied our software to several large genome-wide association data sets and used the results as a foundation to examine a number of issues that may have relevance for the success of this strategy. We obtained relatively strong evidence validating the importance of lipid related pathways in the determination of plasma lipid levels, but we also highlight a key result that positional gene clustering, if not corrected for, can lead to spurious results.

Materials and methods

From SNP lists to Gene lists

A Perl program, named ProxyGeneLD, was developed to automate the assignment of genotyped SNPs from genome-wide association studies to specific genes, flexibly taking into consideration linkage disequilibrium (LD). This was based upon several assumptions. First, a functional SNP is more likely to affect the gene if the variant is located within its open reading frame or in the promoter binding region. Second, the number of SNPs derived from HapMap phase II is large enough to provide coverage of most genes annotated in the various databases describing pathway terms (e.g. Gene Ontology). Third, the program also assumes that the number of common non-HapMap SNPs that are solely located in a gene and in strong LD with a directly observed SNP is negligible.

The program requires five different data files from public databases, which were obtained as follows. The positional data of the entire set of validated polymorphic SNPs from CEU samples of HapMap phase II (release 22) and the file containing NCBI RefSeq transcript locations were attained using the UCSC table browser tool (hg18). Information on pair-wise LD estimates for all HapMap CEU SNPs was downloaded from the official HapMap homepage (release 23a). In the program, gene definitions followed Entrez Gene database conventions and GeneIDs were used as reference identification numbers. Because GSEA-P, one of the downstream applications we utilized, accepts only gene symbols, a step to convert Entrez GeneIDs to gene symbols was also implemented in our program. The data files for GeneIDs and linked official gene symbols were downloaded from Entrez Gene ftp. All of the aforementioned five data files were preprocessed creating more compact files in order to speed up the program. The program initiates by reading the output data file of original genome-wide association study (GWAS) that is tab-, space-, or comma-delimited text files and which contains list of SNPs and corresponding P values.

Two sets of SNPs are considered in the program. The first is the set of SNPs from HapMap CEU phaseII and the other comprises the SNPs tested in original GWAS, which were termed “HapMap SNPs” and “study SNPs”, respectively. The program first examines LD structure between HapMap SNPs and groups two HapMap SNPs at a user defined LD threshold to create a “proxy cluster”. In most cases, the condition r2 ≥ 0.8 was used as a default. The proxy cluster is then expanded by iteratively adding one more HapMap SNP in high LD with any member SNP of the cluster or by merging with another proxy cluster that shares an identical member. The program retains information on which HapMap SNPs were in high LD to form proxy clusters. Each HapMap SNP regardless of membership in a proxy cluster is then assigned to the longest known transcript of each gene including a 1 kbp extension to include promoter regions. Likewise, every proxy cluster is assigned to the nearest gene according to the positions of the SNPs that comprise the cluster. All genes for which any study SNP or any proxy cluster is assigned are then listed, ignoring SNPs which are not represented in HapMap. The significance level of association for each gene before adjustment is chosen as the lowest P value among the single study SNPs that do not belong to any proxy cluster and those in proxy clusters that include one or more study SNP that have been assigned to the gene. The total number of single SNPs and proxy clusters is then multiplied by the pre-adjustment significance level. An illustration of this process is shown in Fig. 1.
https://static-content.springer.com/image/art%3A10.1007%2Fs00439-009-0676-z/MediaObjects/439_2009_676_Fig1_HTML.gif
Fig. 1

An example of how the presented software computes adjusted gene-wide P values. Eight SNPs, including two that are unobserved with known LD are assumed to overlap with two genes. Out of eight markers spanning the region, there are three in tight LD on one side, creating proxy cluster 1, and four in tight LD on the other side, creating proxy cluster 2. The P value for Gene 1 is adjusted to the minimum P value of the markers in proxy cluster 1, since only one cluster spans the gene (min(0.02, 0.01, 0.02) × 1 = 0.01)). Gene 2 overlaps both proxy clusters 1 and 2, in addition to a single isolated SNP. Among the markers in proxy cluster 1, the only one that is directly genotyped is located in the region spanning Gene 2. Therefore, the P value for Gene 2 is calculated by the formula, min(min(0.02), 0.25, min(0.1, 0.09)) × 3 = 0.06

Automated trimming of gene clusters

A gene cluster was defined as a gene set for which the most significant SNP assigned to each gene is a member of the same proxy cluster. In other words, signals from all genes in a gene cluster had a high likelihood of originating from a single association between one marker and phenotype. The program automatically provides a column showing for each gene any other gene(s) that were indiscernible members of the same cluster. As an optional step, it additionally creates one more list of genes in which clusters are sorted out, whereby the gene with the highest ranked SNP is retained and the remaining genes moved to bottom of the list. A note of caution with this is that it might not be appropriate to use this function for cases in which rank is important (e.g. for GSEA-P).

GSEA

Gene set enrichment analysis (GSEA v2.0) is implemented as a JAVA application named GSEA-P and requires a text file containing gene symbols and a weight for gene ranking. For our analyses, gene ranks were calculated by; (weight) = −log10(adjusted gene-wide P value). Since this produces a large number of negative values due to gene-based adjustment, to retain the relative rank positions for all genes beyond the inflection a linear scale ending at zero was created. All reported results using GSEA are based upon 5,000 permutations, P = 1 weighting, maximum set size of 500 and minimum set size of 15. A false discovery rate <25% was used as an indication of potentially interesting findings.

Threshold-based ontology analysis (DAVID)

DAVID is one of the many publicly available web-applications for searching for over-represented pathways in gene lists. It provides extensive options for the interrogation of approximately 40 alternative databases (e.g. KEGG, Interpro, Pfam), but there is a large degree of redundancy when multiple data sources are tested and GO has the greatest number of gene annotation records. Even for GO annotations, redundant terms arise with only marginal differences in the component gene lists. For this reason, for all analyses using DAVID results are presented based upon the use of functional annotation clustering and only the top five single terms from among clusters are reported. For presentation purposes, this was truncated at a maximum of five results. Analyses presented here used DAVID with a March 2008 GO annotation update.

Ingenuity pathway analysis (IPA; Ingenuity Systems)

IPA is a commercial web-delivered application implemented in JAVA of which one of functions is to calculate the probability of observing an association between certain gene sets and pathways by random chance. It applies the right-tailed Fisher’s Exact test and Benjamini–Hochberg method of multiple testing correction (Benjamini and Hochberg 1995) to find over-represented pathways in the gene set comparing to a reference set, which in our analyses is the total genes monitored in each original GWAS. In IPA, annotations of genes to pathways are based on the Ingenuity Knowledge Base, that was built on the extracted findings in major life sciences literature and data in established public databases such as GO and EntrezGene.

General statistics

For various analyses, gene length was taken as the average of all known splice-forms according to RefSeq of NCBI build 36.1. Total unadjusted SNP number for each gene was established according to the genomic interval spanning the longest known splice-form. Analyses involving gene ranks were performed using the Mann-Whitney U test. Gene length, number of markers, and unadjusted genome-wide P values were log10-transformed prior to simple linear regression.

Software availability

The software ProxyGeneLD is available for free download at http://ki.se/ki/jsp/polopoly.jsp?d=26072&l=en. It consists of the main program proxyGeneLD.pl, and two additional programs for preprocessing.

URLs

DAVID 2008: http://david.abcc.ncifcrf.gov/; Entrez Gene download data: ftp://ftp.ncbi.nlm.nih.gov/gene/; Gene Ontology (GO): http://www.geneontology.org/; HapMap bulk data download, LD data: http://ftp.hapmap.org/ld_data/?N=D; UCSC genome browser: http://genome.ucsc.edu/; Ingenuity Pathway Analysis software: http://www.ingenuity.com/; GSEA-P: http://www.broad.mit.edu/GSEA

Results

We developed software to automate the process of converting genome-wide SNP lists to gene lists, beginning with the retrieval of LD structures in analogous populations with denser genotyping data (i.e. HapMap). If a group of markers was in high LD in HapMap, we tied them to a “proxy cluster” treating it as a single signal. Then, each marker in the original SNP list with statistically significant evidence of association with a phenotype was evaluated to see (a) if it belonged to any proxy cluster and (b) if the marker itself or any marker in the cluster was located in a genic region. Any marker or cluster that overlapped a region extending across a gene was assigned as a signal indicating the possible association of that gene. In order to correct the multiple-testing problem that emerges due to multiple signals across a gene, the P value for each gene was adjusted by multiplication of the lowest P value of the assigned signals by the number of signals. A more precise description of the algorithm is presented in the methods section and an example is illustrated in Fig. 1.

There were three core issues that we considered essential to investigate prior to attempting pathway enrichment analyses on disease or trait results from genome-wide association studies (1) the extent of bias introduced by the higher likelihood of low P values in long genes (or genes with many tested polymorphisms) (2) the potential local clustering of genes with similar function and (3) differences in marker lists, in particular as it relates to imputed vs. non-imputed data sets (Marchini et al. 2007).

For the first question, the impact of gene length was assessed using the gene list derived from an HDL data set (Table 1; sample 1), in order to provide an indication of what might be expected from genome-wide marker data rather than all annotated genes in the human genome. We note that this sample consists of approximately 2.3 million directly typed and imputed markers, and thus gene representation reflects what is obtainable from HapMap data (release 21). After conversion using our algorithm, this set also resulted in the largest number of genes (16,977) from among the samples listed in Table 1. These were sorted by average transcript length of known splice-forms (see “Methods”) and the top 1, 2, and 3% of genes were tested by enrichment analysis using DAVID with the total (16,977) as the base set (see “Methods”). All three divisions showed a large enrichment of specific ontology terms with progressively increasing significance from the 1–3% groupings (for clarity we only show the 3% grouping and the top 5 terms in Table 2). Sorting this gene set by the number of SNPs extending across the longest splice-form of each gene resulted in a similar result (Table 2). Finally, assessing genes according to their unadjusted P value for the HDL trait association itself also gave rise to similar term enrichment, although significance was attenuated (Table 2). Simple linear regression of gene length, number of markers, and unadjusted marker significance was as expected high (r2 = 0.8 for length vs. marker number, and r2 = 0.15 for both comparisons of unadjusted P value vs. length and marker number, P ≪ 0.0001 for all three comparisons). These results illustrate the importance of applying strict corrections for gene length (or number of markers), and for this reason, we elected to base our correction upon an LD-adjusted number of markers for each gene (see “Methods”). A related topic is the change in the rank of associated genes following the weighted adjustment by our algorithm. This is shown in Fig. 2, where the ranks of the top 3% of adjusted genes for the HDL trait are plotted against their ranks prior to adjustment (there we also depict a running average of the rank ratios).
Table 1

Genome-wide association studies

Sample

Trait(s)

Size

Markersa

Genesb

References

Sample 1

HDL, LDL, TG

8,816

2.56M

16977

Willer et al. (2008)

Sample 2

Type II diabetes

9,562

2.20M

16128

Zeggini et al. (2008)

Sample 3

Crohn’s disease

8,059

0.63M

16525

Barrett et al. (2008)

Sample 4

HDL, LDL, TG

1,754

0.32M

16934

Aulchenko et al. (2009)

aTotal number of typed and/or imputed markers used for gene list creation

bNumber of genes represented after marker conversion in the meta-analysis data sets

Table 2

GO terms enriched among long genes

Term

Genes with term

Length

SNP number

HDL trait P value

Synapse

37

4.80E-17

1.00E-16

NS

Cell adhesion

61

4.00E-13

2.60E-14

1.00E-06

Plasma membrane part

106

2.20E-12

1.30E-10

2.90E-07

Membrane

254

3.10E-10

1.50E-11

4.30E-05

Nervous system development

52

4.90E-09

5.40E-11

NS

Enriched terms among the longest 3% of genes, by number of markers, and by unadjusted association with the HDL trait. Genes in term refers only to the number in the first analysis of length

NS not significant at α = 0.05

https://static-content.springer.com/image/art%3A10.1007%2Fs00439-009-0676-z/MediaObjects/439_2009_676_Fig2_HTML.gif
Fig. 2

Change in the ranks of genes in sample 1 (GWAS of plasma HDL levels) following gene based adjustment for the number of markers spanning the longest known splice-form. Unadjusted ranks are plotted along the x-axis and adjusted ranks along the y-axis. The sub-panel shows the running average of the ratio between the unadjusted and adjusted ranks and illustrates that the relative change is large for the top-ranked genes. The appearance of lines in the main panel reflects groups of genes with similar numbers of markers (the right-most being genes represented by single markers, the second right-most by 2 markers, and so on)

The second question pertaining to the local clustering of genes with similar functions was in our view a potentially complicated problem. This has been explored to some extent previously (e.g. Lee and Sonnhammer 2003; Yi et al. 2007), but we thought it might be valuable to generate a depiction of the degree of clustering in the genome based upon gene ontology annotations. For this, we again used the gene list derived from the HDL data set (Table 1; sample 1) and a systematic walk was performed along each chromosome from pter to qter, taking 100 genes at a time (this resulted in an average spacing of 17 megabases and a total of 163 bins) and those lists tested for enrichment using DAVID. Gene lists were expanded to 100 < N < 200 at the qter tails. We recognized that the true problem resides in the capturing of multiple genes by virtue of linkage disequilibrium (LD) with a single association signal, but that LD which seldom extends beyond a few hundred kilobases (Weiss and Clark 2002) would result in gene lists that are too small to empower an enrichment analysis. An equivalent analysis was performed using 100 random bins of 100 genes each to generate a null distribution for comparison. In each case, the maximum statistic and associated GO term were recorded and we illustrate the results in Fig. 3 (full data ordered by chromosome and segment interval as well as specifically enriched terms are presented in supplementary Table 1). This revealed that clustering is extensive, with a large proportion of bins showing strong evidence of pathway enrichment. We scrutinized the ten most significant bins to assess the size of the chromosomal segments and the LD structure across the genes that contribute to the signals. In general, the clusters giving rise to significance spanned single contiguous regions much smaller than the total region examined (not shown). However, there were exceptions to this. For example, olfactory genes located on chromosome 11 occur in multiple, separate clusters.
https://static-content.springer.com/image/art%3A10.1007%2Fs00439-009-0676-z/MediaObjects/439_2009_676_Fig3_HTML.gif
Fig. 3

Evidence of wide-spread ontology-based clustering of genes in the human genome. The main panel illustrates the distribution of significance for the first 100 clusters (dark shaded area) against a null distribution for random genes (light shaded area at base). The sub-panel depicts significant clusters according to genome position, ordered from pter of chromosome 1 to qter of chromosome 22. The top three peaks represent “olfactory receptor activity”, “cell adhesion”, and “keratinization”, respectively. See supplementary Table 1 for a full listing of significant terms

To explore how imputation (Marchini et al. 2007) might impact pathway analyses, we examined the WTCCC diabetes data set (Table 1; subset of sample 2), for which both directly genotyped and imputed data were available. After marker conversion, our algorithm resulted in 15,510 genes being represented in the directly typed set (from 393,143 markers) versus 16,347 in the imputed set (from 2,308,536 markers). We noted that the average length of the represented genes from the imputed data were much shorter than those in the non-imputed list (8,995 bases for the 837 additional genes vs. 63,591 bases for the non-imputed set, P ≪ 0.0001). The position of the imputed genes by unadjusted P value rank for the disease association in the full list was also assessed to determine if these were randomly distributed. Average rank for the imputed genes was 11,812 versus 7,977 for the non-imputed set, P ≪ 0.0001. We consider the more important result here to be the inclusion by imputation of short genes, as this last statistic related to rank simply reflects the smaller likelihood of obtaining low P values for short genes. A second question is the relative change in the ranks of genes following imputation. To estimate this, the ranks of the top 3% of genes in the directly typed set were plotted versus their ranks following imputation and the same was done for the top 3% of imputed genes versus the typed set. This is shown in Fig. 4, where the ranks of the majority of genes remain the same, but where clear outliers exist. Of particular note, while the increase following imputation for some genes was expected (Fig. 4a), the loss of rank (Fig. 4b) derives from our algorithm using LD to predict significant un-typed markers in adjacent genes, but where imputation ultimately provides evidence against those particular genes is being significant. One additional analysis was thus conducted, changing the LD threshold to 0.9 from its default of 0.8. This resulted in slightly fewer genes that decreased rank following imputation, but did not affect the number of genes that increased rank (not shown).
https://static-content.springer.com/image/art%3A10.1007%2Fs00439-009-0676-z/MediaObjects/439_2009_676_Fig4_HTML.gif
Fig. 4

Change in gene ranks for a subset of sample 2 (WTCCC type 2 diabetes) using the top 3% of genes from directly genotyped marker lists (x-axis) compared to the same genes derived from imputed lists (y-axis) (a). For comparison, in b, the top 3% of genes from imputed lists (x-axis) are compared to directly typed markers (y-axis). In both cases, while some extreme outliers occur, most genes retain similar rank. All axes are drawn in log10 scale

To initiate pathway analyses of real association results, the HDL set (Table 1; sample 1) was selected with the premise that this might represent a validation of this approach and provide a benchmark for the degree of statistical evidence to be expected. Thus, the emerging genetic architecture of the HDL trait indicates that several of the most significant genes previously implicated on a genome-wide level play a role in lipid metabolism (e.g. Zeggini et al. 2008). The data set comprises a meta-analysis of 8,656 individuals from 3 individual studies, with results representing the marker-by-marker test statistics for 2.3 million markers (directly typed and imputed). We performed a marker-to-gene conversion for this sample and analyzed the top 1% of genes for pathway enrichment (see “Methods”). This threshold was decided upon primarily due to the concern that larger lists might increase the likelihood of enriching for functional gene clusters. From this analysis, we noted that there were at least two possible instances of exaggerated enrichment due to clustering. In the first, three genes contribute to statistical evidence of enrichment of several lipid related terms, these being FADS1, FADS2, and FADS3, all residing in the same genomic interval and LD block. Removing two of these from the input list, retaining only the gene with the highest rank, and re-evaluating the statistics for previously significant terms resulted in an anticipated decrease in the statistical evidence. In contrast, we also noted two genes on chromosome 16q, in relatively tight proximity, that both contribute the enrichment of the term “cellular lipid metabolic process”, these being LCAT and LYPLA3. The most significant markers in these genes were not in high LD, and thus both were retained. Based on these observations, we concluded that while it was trivial and informative to manually curate the results of a pathway analysis by examining the specific genes that contribute to enrichment and to adjust the number downward based upon LD and retaining the gene(s) with maximum significance, that an automated process might be preferable in some cases. For this reason, we added a feature to our software to automatically generate “trimmed” lists to account for genes in high LD (see “Methods”). This feature also allowed us to provide an estimate of the average level of excess gene representation across the entire data set. For this sample that number was 0.53; in other words, each gene had on average 0.53 additional genes that were indiscernible given a certain LD threshold (in this case r2 = 0.8). The results of the manually trimmed analysis of the HDL trait are presented in Table 3. There, the genes that contribute to the enrichment of the term “cellular lipid metabolic process” are specifically highlighted since they served to illustrate the rank assignments from our algorithm of some of the most well known genes that influence HDL levels (Table 4). We extended this analysis to both LDL and triglyceride levels. The results for LDL are shown in Tables 5 and 6, where lipid related terms also emerged at the top. In contrast, for TG, there were no significant terms, even at a liberal uncorrected P value threshold of 0.05.
Table 3

GO terms enriched for plasma HDL levels

Term

Genes with term

P value

Fold change

Carboxylesterase activity

7

9.60E-04

6.1

Response to external stimulus

14

3.50E-03

2.5

Fatty acid metabolic process

7

4.60E-03

4.5

Cellular lipid metabolic process

13

6.30E-03

2.4

Sterol transport

3

8.60E-03

21

The five most significant terms from among the top functional annotation clusters are shown

Table 4

Specific genes contributing to enrichment of the term “cellular lipid metabolic process”

Gene

Rank in GWAS

Marker

P value (original)

CETP

1

rs7203984

4.55E-20

LPL

2

rs328

2.29E-11

LIPC

3

rs1077834

5.18E-10

ABCA1

7

rs4149274

7.36E-08

MVK

17

rs7954144

4.76E-06

FADS1

21

rs174545

4.18E-05

LYPLA3

26

rs17688076

4.18E-05

PPARA

99

rs9615264

4.44E-05

SMPD2

120

rs6911838

1.14E-03

BMP6

136

rs2876117

5.15E-05

CPT1A

147

rs597539

1.53E-04

LCAT

151

rs4986970

1.07E-03

ADIPOR2

160

rs11061935

1.79E-04

Rank in GWAS is according to the LD-adjusted P value of the most significant marker associated with each gene. The presented P value is the unadjusted significance of the specified marker in the original data set

Table 5

GO terms enriched for plasma LDL levels

Term

Genes with term

P value

Fold change

Sterol transport

4

4.30E-04

25

Lipid transporter activity

6

6.40E-04

8.6

Intracellular organelle

88

1.30E-03

1.3

Nucleic acid metabolic process

47

8.10E-03

1.4

Macromolecular complex assembly

12

1.10E-02

2.4

The five most significant terms from among the top functional annotation clusters are shown

Table 6

Specific genes contributing to enrichment of the term “sterol transport”

Gene

Rank in GWAS

Marker

P value (original)

LDLR

4

rs6511720

6.76E-10

APOB

5

rs531819

1.01E-09

APOE

93

rs405509

1.70E-03

ABCA1

105

rs2000069

2.25E-05

Rank in GWAS is according to the LD-adjusted P value of the most significant marker associated with each gene. The presented P value is the unadjusted significance of the specified marker in the original data set

We next applied the above analytical scheme (first converting PLINK results to gene lists and performing enrichment tests using DAVID) to two additional genome-wide data sets, selected from the growing list of studies (http://www.genome.gov/gwastudies/) due to their size (Table 1; samples 2, 3). The most notable finding from these sets was the possible enrichment of microtubule genes in type 2 diabetes (Table 7). An interesting feature of this result was the marked difference from what was observed for lipids in that the ranks of the component genes that contribute to term significance are all much lower, but this set also had the benefit of these genes each residing in unique loci (Table 8). In contrast, the analysis of Crohn’s disease was more problematic with the apparent enrichment of several terms related to immune response appearing to arise due to clustering of genes in the well-characterized HLA region on chromosome 6p. We elected for this sample to present both the pre-trimmed and the trimmed results for this data set since it provides an illustration of the impact of clustering and where the requisite trimming changes the significance substantially (Tables 9, 10). Notably, even after trimming some possible clustering remains involving three genes, although they were not formally in strong LD (MICA, HLA-DQA2, and HLA-C).
Table 7

GO terms enriched for type 2 diabetes

Term

Genes with term

P value

Fold change

Microtubule-based process

10

1.30E-04

5.1

Response to nutrient

5

6.10E-04

12.6

Advanced glycation end-product receptor activity

3

9.90E-04

59

Intracellular organelle

82

1.40E-02

1.2

Cellular component assembly

12

1.60E-02

2.3

The five most significant terms from among the top functional annotation clusters are shown

Table 8

Specific genes contributing to enrichment of the term “microtubule-based process”

Gene

Rank in GWAS

Marker

P value (original)

KIF2A

49

rs152189

1.03E-04

KIF21A

78

rs4768736

9.96E-05

KIF3C

100

rs2384298

1.63E-04

HCN4

121

rs1564345

1.77E-04

PCLO

123

rs1157530

4.06E-05

MAP3K11

129

rs7946115

6.47E-04

CENPE

137

rs12506065

5.46E-04

KIF11

150

rs4933734

2.76E-04

DNAL4

158

rs760482

7.89E-04

KPNA2

161

rs4638

3.25E-03

Rank in GWAS is according to the LD-adjusted P value of the most significant marker associated with each gene. The presented P value is the unadjusted significance of the specified marker in the original data set

Table 9

GO terms enriched for Crohn’s disease

Term

Genes pre-trim

P value pre-trim

P value post-trim (genes)

Fold change post-trim

MHC protein complex

6

1.30E-05

NS

NA

Immune response

18

4.60E-05

0.011 (13)

2.3

Regulation of apoptosis

13

5.30E-03

0.033 (11)

2.1

Nucleocytoplasmic transport

6

1.20E-02

No change

4.3

Kinase activity

16

1.30E-02

No change

1.9

The five most significant terms from among the top functional annotation clusters are shown

NS not signficant at α = 0.05, NA not applicable

Table 10

Specific genes contributing to enrichment of the term “immune response” after trimming

Gene

Rank in GWAS

Marker

Genomic position

P value (unadjusted)

IL23R

1

rs11465804

chr1:67.4 Mb

1.01E-35

NOD2

2

rs2076756

chr16:49.2 Mb

3.42E-32

NKX2-3

7

rs10883371

chr10:101.2 Mb

1.82E-10

IRF1

17

rs2188962

chr5:131.8 Mb

4.58E-09

TNFSF15

39

rs6478109

chr9:116.6 Mb

2.61E-07

CCR6

50

rs7749278

chr6:167.4 Mb

3.29E-07

CCL11

77

rs991804

chr17:29.6 Mb

4.01E-06

MICA

88

rs2596503

chr6:31.5 Mb

3.94E-06

HLA-DQA2

99

rs7768538

chr6:32.8 Mb

7.92E-06

HLA-C

121

rs2905747

chr6:31.3 Mb

1.50E-05

IFI30

122

rs7125

chr19:18.1 Mb

3.61E-05

PTPN22

150

rs2476601

chr1:114.2 Mb

1.81E-05

CSF3

160

rs8078723

chr17:35.4 Mb

5.51E-05

Genomic position is according to NCBI Build 36.1 of the human genome

Rank in GWAS is according to the LD-adjusted P value of the most significant marker associated with each gene. The presented P value is the unadjusted significance of the specified marker in the original data set

While DAVID represents an excellent starting point for pathway analyses given its flexibility and fairly comprehensive coverage of recent annotations, there are two important alternatives. For the first, in order to complement the above analyses we chose to focus again on plasma lipid traits, and a search for enrichment was attempted using gene set enrichment analysis (GSEA; see “Methods”) which was originally designed to circumvent the arbitrary nature of threshold definitions (Subramanian et al. 2005; see “Methods”). For HDL levels in particular, this produced only modest evidence implicating lipid metabolism (Tables 11, 12). Performing GSEA based analyses on the LDL trait in this same sample failed to identify significant terms (not shown). For TG, a single term was significant at P = 0.0002 (heparin binding). However, a problem emerged when we simply removed the top ranked gene (CETP) from analysis of HDL, which almost completely eliminated the significance for lipid terms (Tables 11, 12). We think this reiterates the importance of more closely examining the specific genes that contribute to term enrichment statistics. Perhaps more importantly, this highlights a potential problem with GSEA in that strong statistical support for a pathway can be driven by a single gene. The second software that is achieving widespread use now is Ingenuity Pathway Analysis (IPA). We again subjected our lipid results for sample 1 (top 1%) for analysis using IPA specifically restricting our assessment to the category of molecular and cellular functions. For this analysis, we chose to focus on pathways with three or more contributing genes after trimming of any identified positional clusters. The best evidence of pathway enrichment was obtained for “homeostasis of cholesterol” in relation to plasma LDL levels with an uncorrected P value of 9.3 × 10−6 (note that IPA bases its analyses on a Fisher’s Exact Test which is comparable to DAVID) and included the genes ABCA1, APOB, APOE, LDLR, and PLSCR3. The next best evidence was for “metabolic process of lipid” in relation to HDL levels with an uncorrected P value of 1.3 × 10−4 and included the genes CETP, FADS1, IKBKB, IL6, LCAT, LPL, PLA2G15, PPARA, RAC1, and SMDPD2. These two lists can be directly compared to Tables 4 and 6, where several genes are represented in both analyses, but where there are differences. This highlights an important aspect of which down-stream analysis software one opts to use, as there clearly are gene annotation discrepancies. There was no evidence for enrichment of any category with more than three genes at P < 0.05 for TG.
Table 11

GO term enrichment for HDL from GSEA analyses

Term

Core genes

P value—sample 1

P value w/o CETP

Lipid transport

6

4.00E-04

0.037

Lipid transporter activity

5

8.00E-03

0.67

Reported significance is based upon 5,000 permutations. The analysis excluding CETP was done in sample 1

Table 12

Specific genes contributing to enrichment of the term “lipid transport”

Gene

Rank in GWAS

Rank Metric Score

CETP

1

18.0

ABCA1

7

5.2

PPARA

101

2.8

LCAT

153

2.5

PSAP

257

2.1

ABCD3

292

2.1

PPARD

734

1.5

Rank in GWAS is from sample 1

Finally, we sought evidence of replication for the above analyses by focusing on a second independent sample with plasma lipid traits from a recent genome-wide association study (Table 1; sample 4). We decided to focus on IPA for this analysis, given its somewhat better apparent performance compared to the other approaches above. We were unable to validate the result for LDL levels, with no functional categories with three or more genes achieving P < 0.05. However, for HDL levels, results were highly consistent with the data from sample 1, with multiple categories replicating between the two sets and with “lipid metabolism” being the most significant high-level category in both samples. The best evidence of replication was obtained for the lower-level term “homeostasis of cholesterol” with P = 3.3 × 10−4 for sample 1 and P = 5.3 × 10−4 for sample 4 (using Fisher’s combined probability test gave P = 2.3 × 10−6). This represented the best study-wide significance across all tests conducted.

Discussion

We developed and implemented a new software program to facilitate the conversion of genetic marker lists to gene lists, with the primary goal of uncovering evidence of pathway enrichment from genome-wide association studies. To accomplish this, only the largest of the many available genome-wide association study data sets were targeted, with an initial focus on plasma lipid levels that we thought might have a detectable pathway basis. We obtained a reasonably high level of statistical evidence by replicating an enrichment of lipid related terms for plasma HDL levels in two independent samples. We consider, however, the more important aspect of this work to be the broader exploration of various parameters that may affect the validity of this strategy.

A central issue in approaches such as ours is how to best obtain representation of genes from markers. Our study is distinct from previous efforts to apply pathway approaches to genome-wide data, in that we treat gene representation as a core issue. For example, the original study by Wang lists only the assignment of markers within 500 kb of genes, without further detail (Wang et al. 2007). The second study by Torkamani provides more information, linking SNPs to genes with 5 kb, but there they invoked a hierarchy giving more weight to coding SNPs and they didn’t specifically address the issue of LD (Torkamani et al. 2008). Regardless, any strategy like that presented will both increase gene diversity in a region of a single association signal, potentially causing pathway dilution, and create the possible “appearance” of enrichment due to the positional clustering of functionally related genes. For the former, we noted that across the various data sets, the inflation factor was around 0.5, or on average 50% more genes than would be expected if LD was r2 ≥ 0.8 between the markers showing the highest significance. The consequence of this will be to increase type II error. The latter problem, however, in our view, is much more serious and a few cases are exemplified where the statistical support decreases following correction for clustering. Although we attempt to resolve clustering with both manual and/or automatic curation, the only real solution will require the identification of true functional variants that can be shown to influence specific genes. Thus, functional studies on gene regulation, including identifying markers that directly affect gene expression and can be tied to a specific gene are extremely important for pathway-based analyses. There are a number of impressive in silico tools now appearing for marker and gene prioritization that may also aid this endeavor (Chen et al. 2008; Gaulton et al. 2007; Ge et al. 2008; Pico et al. 2009; Tranchevent et al. 2008). The many published genome-wide data sets also lend themselves to strategies to test for marker independence (Sun et al. 2008), but there will also be merit in exploring alternative human populations, where LD is less of a problem (Cox et al. 2002). For the time being, we acknowledge that our algorithm leads to the inclusion of an excess of genes around significant signals, but we do note that the software includes options for changing LD thresholds. Nonetheless, in terms of pathway analyses it may be argued that results of enriched terms where the contributing genes all reside in independent loci may be regarded as more reliable than those where the genes are clustered, even if not in strict LD.

We think there is reason to be cautious about biological interpretations of pathway results across the data sets here and elsewhere. The space of variables that can be tuned for these kinds of analyses is quite large, ranging from the statistics and covariates chosen in the original genome-wide studies to which pathway enrichment tool/statistic is used. This is particularly important given the relatively large number of analysis programs now available, including those used here (DAVID, GSEA-P, and IPA). Against this background, the possible enrichment of terms associated with lipid metabolism in two independent data sets for HDL levels represents a reasonable validation of this approach. However, this statistical evidence should be taken in context with that of the most significant individual gene (CETP; P ≈ 10−20). If this is an indication that the evidence one can expect from pathway analyses is likely to be weaker than for individual genes, then even larger samples than those currently employed may be required for detection. An interesting contrast to the HDL result is the essential lack of evidence in the Crohn’s disease sample. In particular, the role of the immune system for this disease is well documented (Lettre and Rioux 2008) yet is only marginally implicated by our pathway analysis. The result however reiterates a classic problem in genetics, namely the possibility that there are multiple independent functional variants acting in the HLA region that are indiscernible due to LD. This may have implications for other immune diseases, such as rheumatoid arthritis (Raychaudhuri et al. 2008). The result for diabetes yielded yet a third contrasting scenario, where there was a suggestion of the involvement of microtubule genes but no confounding by positional clustering. An intriguing aspect of this result was the inclusion of the KIF11 gene, which resides in an LD block that includes HHEX and IDE. We have focused on IDE in the past (Gu et al. 2004), and others have highlighted maximum signals nearer HHEX (Zeggini et al. 2007). Still, though there is no strong precedent for the involvement of kinesin proteins in type 2 diabetes from the literature, the additional information included by a pathway analysis may help to prioritize specific genes for further functional assessment.

At the outset of this investigation we were interested in what the evolutionary implications might be should evidence of pathway enrichment emerge. Given that both allelic and locus heterogeneity act to influence phenotypes, a natural concern was that population differences could contribute to apparent enrichment. Thus, two genes with similar annotations and acting detectably on a phenotype but exclusively in separate sub-populations would be considered together. For genes such as CETP this is unlikely to be the case since it has been demonstrated to be highly significant in numerous populations, but for genes farther down in rank it may be. This ultimately represents a trade-off in terms of obtaining high statistical power for single marker analyses (via meta-analysis) and the risk of enriching a pathway due to locus heterogeneity across different populations. Nonetheless, the assumption that diseases should arise by mechanisms involving similar genes has long been one the corner-stones of genetic association studies. Complex phenotypes have evolved to be resilient to insult (Buchanan et al. 2006), and the emergence of modest to weak effects across multiple genes in a pathway might be seen to support this.

In summary, pathway analyses of genome-wide association data based upon controlled gene annotations are now emerging, with many groups claiming evidence of enrichment for a range of phenotypes (Baranzini et al. 2009; Torkamani et al. 2008; Wang et al. 2007). The data presented in this paper also support the prospect that biological pathways can be detected in genome-wide association data, but we have still only scratched the surface of the bulk of genome wide data becoming available and we think there is still reason to be cautious. In particular, the observation that persists both here and in other studies is that there is no case where the evidence for any pathway enrichment exceeds that of the previously reported single locus findings. This can be taken as a strong contrast to gene expression studies, where striking evidence of pathway enrichment often emerged in the absence of obvious single gene effects (Mootha et al. 2003). The application of pathway approaches to genetic marker data is nonetheless relatively new territory and may represent a valuable addition to ongoing genome-wide studies given the vast range of phenotypes now being investigated (http://www.genome.gov/gwastudies/). This needs to be tempered, however, against a number of issues that have previously not been dealt with, the most important in our view being positional gene clustering. The software provided here should nonetheless help to relieve the bottleneck of automating the transition from marker lists to gene lists, and thus expedite further analyses of genome-wide data in a pathway context.

Acknowledgments

We are greatly indebted to the scientists responsible for making their genome-wide data accessible. We are grateful for financial support from The Swedish Medical Research Council (grant 2007-2722) and the National Institutes of Health (grant AG028555).

Conflict of interest statement

The authors declare no conflict of interest.

Supplementary material

439_2009_676_MOESM1_ESM.xls (42 kb)
Supplementary material 1 (XLS 42 kb)

Copyright information

© Springer-Verlag 2009