Pathway-based analysis for genome-wide association study data of bipolar disorder provides new insights for genetic study

Dear Editor, 
 
Bipolar disorder (BD) is an episodic recurrent pathological mood disturbance with estimated heritability ranging from 80%–85% (Barnett and Smoller, 2009). Although many studies have indicated that BD is a polygenic disease influenced by many genes with small effect, the pathogenesis of BD is still not well understood. According to the catalog of published genome-wide association studies (GWAS) (Welter et al., 2014), there have been 31 BD GWASs (including bipolar disorder related symptoms and combined GWAS with other related disorders) till 01/15/2015, including the largest BD GWAS from the Psychiatric Genomics Consortium (PGC) (Sklar et al., 2011). But these GWAS analyses mainly focused on single SNP/gene and only identified a number of significant SNPs that account for a small proportion of the genetic variants. To detect the association between pathway and trait for further GWAS data interpretation and mechanism study, pathway-based analysis (PBA) has been introduced to analyze GWAS data (Wang et al., 2010). 
 
Comparing with traditional genetic association analyses, PBA can detect the additive effects of multiple minor genes. It has been proved to be a feasible solution to interpret GWAS result and promote the analytical level from SNP/gene to pathway. In recent years, a number of PBA methods have been developed whose null hypothesis being tested can be broadly classified as ‘self-contained’ versus ‘competitive’ based on whether comparisons were made between genes in a specific pathway and non-associated genes or other genes in the genome. Additionally, the input data of these statistical methods can be broadly classified into two types: SNP P-values and raw genotype (Wang et al., 2010). 
 
Among the several published PBA studies for BD (as collected in BDgene (Chang et al., 2013)), the majority of the data sources focused on the Wellcome Trust Case Control Consortium (WTCCC) data (Wellcome Trust Case Control 2007), which was published several years ago. The large scale of PGC data has been used in two PBA studies (Duncan et al., 2014; Yu et al., 2014a), but they are both hypothesis-based and only focus on one type of pathway instead of all pathways. In addition, using more than one PBA tools to get the consistent result has become more popular (Duncan et al., 2014) although most of the early studies only used one. By now, the results of published PBA studies are still of low repeatability and need further replication. Hence, it is necessary to further analyze the large scale of GWAS data from PGC by using PBA methods to get more reliable disease related pathways. 
 
In this study, we analyzed six BD GWAS datasets from PGC (Sklar et al., 2011) by merging them into two groups according to the platform. The same quality control (QC) process was conducted for both groups of data by using PLINK (see supplementary material for detail). After QC, the two groups of datasets contained 4568 cases and 6255 controls in total, and the numbers of overlapped SNPs were 276,592 in Group A and 477,624 in Group B separately (as shown in Table S1). We used three tools (i-GSEA4GWAS v2 (Zhang et al., 2014), SNP Ration Test (SRT) (O’Dushlaine et al., 2009) and GenGen (Wang et al., 2007)) for our analysis. The brief introduction and parameter selection about these three tools were described in supplementary material. 
 
The pathway-based analysis results for two groups of GWAS datasets by three PBA tools were summarized in Table 1. By inputting SNP P-value list, i-GSEA4GWAS v2 generated 31 candidate pathways for Group A and 31 candidate pathways for Group B, which shared 12 pathways between Group A and Group B. The calculation of gene-set enrichment score was similar between i-GSEA4GWAS v2 and GenGen, but GenGen used genotype data as input and permutation was conducted at phenotype level. GenGen got 133 candidate pathways from Group B, but none from Group A. Furthermore, SRT provided 10 candidate pathways from Group A and 5 candidate pathways from Group B, and there was no overlapped pathway between Group A and Group B. Comparison among different PBA tools for the same GWAS dataset was conducted. For Group A, i-GSEA4GWAS v2 shared four pathways with SRT; no shared pathway for either other pairs of tools and three tools. For Group B, i-GSEA4GWAS v2 shared 27 pathways and two pathways with GenGen and SRT separately; GenGen and SRT shared three pathways, in which, two pathways were shared by three tools. 
 
 
 
Table 1 
 
Comparison result for different groups of GWAS datasets and PBA tools 
 
 
 
As a result, 33 unique pathways were obtained from the shared pathways by GWAS datasets or PBA tools. The list of pathways and supported datasets and tools were shown in Table 2. Among these 33 pathways, one pathway “TGF beta signaling pathway” was validated by one PBA tool, two pathways (“Oocyte meiosis” and “Ubiquitin mediated proteolysis”) were validated by all three PBA tools, other pathways were validated by two PBA tools. Moreover, six pathways had FDR < 0.05 in Group A dataset and six pathways had FDR < 0.05 in Group B dataset. Especially, ‘Retinol metabolism’ and ‘Metabolism of xenobiotics by cytochrome P450’ had FDR < 0.05 in both Group A and Group B datasets, which showed great possibility of correlation with BD. 
 
 
 
Table 2 
 
Validated pathways by two groups of GWAS datasets or at least two PBA tools 
 
 
 
Our analysis also validated several pathways reported by previous PBA studies. For example, phosphatidylinositol signaling system and p53 signaling pathway were identified as risk pathways by Chen et al. using a risk-scoring measurement to fuse SNPs and pathways (Chen et al., 2009); N-Glycan biosynthesis, ABC transporters and cell cycle were identified in (Peng et al., 2010) by using hypergeometric test; Neurotrophin signaling pathway has not been reported by PBA of BD yet, but it has been reported by network analysis paper for schizophrenia (Yu et al., 2014b), and neurotrophic signaling cascades has played an important role in the pathophysiology and treatment of bipolar disorder (Shaltiel et al., 2007). 
 
Besides the pathways validated by other PBA papers, we also identified several novel pathways, which have not been reported to be associated with bipolar disorder by PBA study but have been validated in other literature and deserve further attention, including ‘Ubiquitin mediated proteolysis’ and ‘Oocyte meiosis’, which were validated by three PBA tools in our analysis and also statistical significant in Group A dataset (FDR < 0.05); ‘Retinol metabolism’ and ‘Metabolism of xenobiotics by cytochrome P450’, which were identified in both two groups with significant association with BD (P-value < 0.05, FDR < 0.05). Evidence for the association of these pathways with bipolar disorder and other related psychiatric disorders from literature was summarized in the supplementary material. 
 
In summary, compared with previous PBA studies on BD, our analysis includes more samples and uses genotype data to analyze all KEGG pathways. By using multiple GWAS datasets and PBA tools, our analysis could provide better interpretation for the GWASs of BD to further identify disease-related mechanism. The pathways we identified would provide new insights for the genetic and mechanism study of BD. In particular, the novel pathways we first identified such as ‘Ubiquitin mediated proteolysis’, ‘Oocyte meiosis’ and ‘Metabolism of xenobiotics by cytochrome P450’ would provide new perspectives to the following studies. These pathways need and also deserve further validation and replication to explore the pathogenic mechanism of BD.


GWAS datasets
The six BD GWAS datasets for this analysis (as shown in Stable 1) include Systematic Treatment Enhancement Program for Bipolar Disorder (STEP1), University College London (UCL), Trinity College Dublin (Irish) and Thematically Organized Psychosis (TOP) Study (Norwegian) from NIMH Repository and Genomics Resource (Distribution 8.0), GWAS for BD from Wellcome Trust Case-Control Consortium (WTCCC), and Genetic Association Information Network European-American ancestry (GAIN_EA,Accession No. phs000017.v3.p1) from dbGaP. The same quality control (QC) process was conducted for both groups of data by using PLINK (Purcell et al., 2007): (i) exclude SNPs with missing rate >0.05 or minor allele frequency <0.05 (before sample removal below), (ii) exclude individuals with missing rate >0.05 or heterozygosity rate outside ± 3 SD (standard derivation), (iii) calculate identity by descent (IBD) after pruning SNPs using r 2 =0.2 as threshold, and then exclude duplicated or related individuals with IBD>0.185, and (iv) exclude SNPs with significant differences in case and control call rates (P<10 -5 ) or minor allele frequency >0.05 or Hardy-Weinberg equilibrium P <10 -6 . Principal component estimation was done with the same collection of SNPs for IBD calculation using EIGENSTRAT (Price et al., 2006). We estimated the first 20 principle components and evaluated their impact on the genome-wide test statistics using λ as (Ripke et al., 2011). Based on this, we used the top five principle components with the smallest λ together with the study indicator variables as associated covariates for logistic regression test to get P-values of SNPs.

i-GSEA4GWAS v2
i-GSEA4GWAS v2 (http://gsea4gwasv2.psych.ac.cn/) is a free web server as an open platform for researchers to analyze GWAS data (Zhang et al., 2014). Unlike the conventional gene set enrichment analysis (GSEA) (Subramanian et al., 2005) with a heavy dependence on genotype data, this approach uses SNP label permutation instead of phenotype label permutation to calculate P-values, which makes it easily available for most published GWAS investigations. Besides, based on the enrichment score (ES), i-GSEA4GWAS v2 uses the significance proportion based enrichment score (SPES), which emphasizes the total significance coming from high proportion of significant genes to improve sensitivity. The SNP P-value list from association analysis was used as the input of i-GSEA4GWAS v2. We mapped the SNPs in the 5 kb upstream and downstream of gene to genes. The pathway dataset, which was served as the search space of our PBA, was from Kyoto Encyclopaedia of Genes and Genomes (KEGG) (Ogata et al., 1999).
Only the pathways with size (number of genes involved) more than 20 and less than 200 were used in our PBA for BD. Other input parameters were default. Those pathways with false-discovery rate (FDR) < 0.25 were kept as candidate pathways associated with BD.

SNP Ratio Test
The SNP ratio test (SRT) (O'Dushlaine et al., 2009) compares the proportion of significant SNPs to all SNPs within genes that are part of a pathway and computes an empirical P-value based on comparisons to ratios in datasets where the assignment of case/control status has been randomized. The input is the genotype data of GWAS in PLINK binary format. The same KEGG pathways as we used in i-GSEA4GWAS v2 were used for pathway searching. Mapping rule from SNPs to genes was also 5 kb upstream and downstream. Standard association analyses were performed for both the original data and 1000 randomized phenotype datasets. Finally, a list containing the empirical P-value for the statistically significant enrichment of GWAS associated SNPs within each KEGG pathway was generated. Those pathways with P-value < 0.05 were kept as candidate pathways associated with BD.

GenGen
GenGen is a modified GSEA algorithm to perform pathway-based analysis of GWAS data. It takes the maximum statistic for all SNPs near a gene to represent the significance of the gene and uses a permutation-based approach that shuffles the phenotype labels of cases and controls to adjust for multiple testing (Wang et al., 2007).
The input file for GenGen is genotype data. The same KEGG pathways and SNP -> gene mapping rule were used as the other two tools. According to our data size, 1000 permutation cycles were used to adjust for multiple-hypothesis testing. Those pathways with FDR< 0.25 were selected as final results.

Literature evidence for new identified pathways
Besides the pathways validated by other PBA papers, we also identified several novel pathways, which have not been reported to be associated with bipolar disorder by PBA study but have been validated in other literature and deserve further attention, including 'Ubiquitin mediated proteolysis' and 'Oocyte meiosis', which were validated by three PBA tools in our analysis and also statistical significant in Group A dataset (FDR < 0.05); 'Retinol metabolism' and 'Metabolism of xenobiotics by cytochrome P450', which were identified in both two groups with significant association with BD (P-value < 0.05, FDR < 0.05).
Protein ubiquitination plays an important role in eukaryotic cellular processes; and it mainly functions as a signal for 26S proteasome dependent protein degradation. Now, proteolysis by the ubiquitin-proteasome pathway (UPP) is widely recognized as a molecular mechanism controlling myriad normal functions in the nervous system (Hegde and Upadhya, 2011). This pathway has been identified as a canonical pathway associated with several neuropsychiatric and brain disorders, including Alzheimer's (Lam et al., 2000), Parkinson's (Shimura et al., 2001), bipolar disorder (Bousman et al., 2010), ethanol-induced disorders (Donohue, 2002;Hegde and Upadhya, 2011) and schizophrenia (Middleton et al., 2002;Altar et al., 2005). Recently, Maria et al. have reported the abnormalities of ubiquitination system in schizophrenia by using gene expression analysis (Rubio et al., 2013). The evidence supported the association of 'Ubiquitin mediated proteolysis' with bipolar disorder. Meiosis is a specialized type of cell division, which reduces the chromosome number by half. Although there is no direct evidence showing that oocyte meiosis is related to BD, a study has pointed out that ubiquitin-proteasome pathway plays important roles in oocyte meiosis resumption, spindle assembly, polar body emission, and pronuclear formation, probably by regulating cyclin B1 degradation and MAPK/p90rsk phosphorylation (Huo et al., 2004). Among these, 'MAPK signaling pathway' is reported by two PBA studies for BD (Chen et al., 2009;Peng et al., 2010). Moreover, there is a report that the aggregation of abnormally phosphorylated Tau proteins, which is considered as typical pathology of Alzheimer, is related to the reactivation of mitotic mechanisms (Delobel et al., 2002). Another pathway we identified, named 'Progesterone-mediated oocyte maturation', is also involved in this process. It provides evidence that the pathway 'Oocyte meiosis' may be associated with BD. Hence, it is worthy to further study the relationship between oocyte meiosis and BD.
Another two remarkable pathways were 'Retinol metabolism' and 'Metabolism of xenobiotics by cytochrome P450'. The pathway 'Retinol metabolism' has been reported in a PBA study for BD (Smith et al., 2009), providing a validation for our result.
'Metabolism of xenobiotics by cytochrome P450' has not been reported in any BD PBA studies so far, but many studies that focused on the drug metabolism and treatment of BD have mentioned cytochrome P450. Quetiapine, which was found to be effective in the treatment of BD, is predominantly metabolized by cytochrome P450 3A4 (Khazaal et al., 2013).