Pathway Analyses and Understanding Disease Associations
- 875 Downloads
High throughput technologies have been applied to investigate the underlying mechanisms of complex diseases, identify disease associations, and help to improve treatment. However, it is challenging to derive biological insight from conventional single gene-based analysis of “omics” data from high-throughput experiments due to sample and patient heterogeneity. To address these challenges, many novel pathway- and network-based approaches have been developed to integrate various “omics” data, such as gene expression, copy number alteration, genome-wide association studies, and interaction data. This review will cover recent methodological developments in pathway analysis for the detection of dysregulated interactions and disease-associated subnetworks, prioritization of candidate disease genes, and disease classifications. For each application, we will also discuss the associated challenges and potential future directions.
KeywordsPathway analysis Dysregulated interaction Disease association Genome-wide association studies (GWAS) Gene prioritization Disease classification
Biomedical research has been revolutionized by advanced high-throughput (HT) technologies for study of genomic, transcriptomic, proteomic, and metabolomic “molecular phenotypes” provided by technologies such as microarray, next generation sequencing, RNAi library screening, and high-throughput and high-resolution mass spectrometry [1, 2, 3]. However, due to the complexity of diseases, background noise in HT experiments, the need for multiple hypothesis testing corrections, and patient heterogeneity, it has been challenging to interpret the direct results from experiments to elucidate biological mechanisms relevant to complex diseases [4, 5••, 6]. Recently, methods targeted on pathway level analyses have been developed and applied to investigate the underlying mechanism of complex diseases . The rationales behind these methods are multiple: genes/proteins do not work alone, but in an intricate network of interactions and pathways. In addition, complex diseases are more likely caused by the dysregulation of multiple targets in connected pathways and/or different genes in the same pathways in different patients. Pathway analysis has statistical advantages in that it can reduce the dimensionality of HT datasets and provide a focused set of targets for biological validation. However, error rate estimation is more likely to be empirical than grounded in theory. Identifying disease-associated pathways can help to understand disease mechanisms and has the potential to improve diagnostics and develop efficient treatments.
Detecting Interaction Dysregulation
The majority of pathway analyses can be grouped into three classes: over-representation analysis, functional class scoring, and pathway topology-based methods [22•]. The over-representation analysis starts from a list of genes and a set of pathways; every pathway is tested for over- or under-representation in the list of input genes using a statistical test based on hypergeometric or Chi square distribution. This approach treats each gene equally, and ignores data associated with each gene, like mRNA expression levels or p values from GWAS. Many popular methods, such as FatiGO  and GoMiner , belong to this class. Alternatively, the functional class scoring approach, such as the well-known gene set enrichment analysis (GSEA), gene set analysis (GSA), and similar methods [8, 9], takes genes and their associated expression values as inputs. A gene-level statistic is computed, typically using a t test; then, for each pathway, a single pathway-level statistic is computed by aggregating gene-level statistics; finally, the significance of the pathway-level statistics are evaluated empirically by permutation. The basic steps of the pathway topology-based approach are quite similar to functional class scoring, except that it takes into account the pathway topology when computing the gene-level statistics [23, 24]. However, almost all methods described above are designed to identify disease-associated pathways by investigating the changes of genes, which are one of the components in pathways. More recently, approaches have been proposed to investigate other components in the pathways, such as interactions.
The physical entities in a pathway, like genes, are only one of the fundamental components in the pathways (in the network model, genes are represented as nodes). Other important components are interactions among them, i.e., gene and protein interactions, and the dynamics of those interactions (in the network model, interactions are represented as edges). Both genes and interactions among them are essential and tightly regulated for the proper functioning of the system; perturbation of either of them can lead to dysregulation, i.e., diseases [25, 26]. Studies have showed that cellular networks exhibit systems properties underlying phenotypic variations [5••, 27, 28]. Zhong et al.  analyzed 50,000 known disease-causative mutations, and proposed two distinct mutations: one type leads to node removal from the network due to the destruction of the reading frame or destabilization of protein structure; the other type, such as single amino-acid substitution at the binding site, may affect the ability to bind/interact with its partners. The latter type was considered as edge-specific (edgetic) perturbations, which confer distinct functional consequences compared to node removal . Identifying and distinguishing both types of mutations will improve our understanding of diseases and help to develop efficient treatments. In this section, we focus on the methods that are designed to detect dysregulated pathways in term of interactions.
Liu et al. [30••] proposed the gene interaction enrichment and network analysis (GIENA) to identify dysregulated gene interactions and pathways using functions that model the relationship of cooperation, competition, redundancy, and dependency among the expression levels of genes. These functions are defined as follows: the sum of mRNA expression levels, which models cooperation; the difference between mRNA expression levels models competition; and the maximum/minimum mRNA expression level models redundancy/dependency between a pair of genes. Moreover, the regulatory logic governing the perturbation in diseases can be constructed based on the detected dysregulated interactions. The proposed framework was applied to identify dysregulated pathways in cancer. The results showed that GIENA can identify pathways that are well known and biologically meaningful, the results are highly reproducible, and GIENA is efficient in terms of extracting weak signals and identifying pathways that are missed by with a gene-centered method, such as GSEA/GSA [8, 9]. In other studies, the relative expression of two genes has also been applied to classify two closely related cancers, and identify tightly regulated networks and their changes in diseases [31, 32]. In another study, Taylor et al.  defined the difference in the expression of the hub gene with each of its partners as interaction coherence, and the change of interaction coherence was measured between diseases and control samples.
Mani et al.  developed a method to identify gene pairs showing either a gain of correlation (GoC) or a loss of correlation (LoC) pattern of gene expression in the diseases, compared with the pattern in healthy individuals. A gene set is constructed and its interactions are catalogued, and these interactions are either gained (GoC) or lost (LoC), i.e., dysregulated, in the diseases under investigation. The dysregulated interactions are pooled together to identify genes with a significantly high number of dysregulated interactions in their neighborhood. Combining the B-cell interactome with gene expression profiles from three malignant B-cell phenotypes, the authors demonstrated that their method can identify genes and pathways enriched for such gained or lost correlations, which are likely implicated in tumorigenesis, and their method can detect some well-known oncogenes, such as BCL2 and SMAD1, which traditional methods can fail to detect . They also found that the patterns of dysregulated interactions are dramatically different among three malignant B-cell phenotypes, indicating different underlying mechanisms among them. In another study, Zhang et al.  proposed a similar method to detect dysregulated interactions and pathways in diseases. In their study, the difference of co-variances or correlations between two genes from healthy and disease groups represented the interaction between them. Coupled with GSA , their method was able to detect pathways with dysregulated interaction enrichments .
Watkinson et al.  utilized a synergy concept from information theory to define types of gene interactions. The synergy of two genes is defined as a function of mutual information (MI) between gene expression profiles (gene1 and gene2) and phenotype status (phenotype): Synergy (gene1, gene2) = MI(gene1, gene2; phenotype) − [MI(gene1; phenotype) + MI(gene2; phenotype)]. Positive synergy indicates gene interactions, and a synergy network can be constructed based on detected interactions. Using gene expression data from prostate cancer and healthy individuals, the authors found strong synergies between many gene pairs, which can predict prostate cancer much better than the simple additive individual genes. RBP1 appears most frequently in high-synergy gene pairs. RBP1 inhibits the PI3K/Akt survival pathway, indicating that PI3K/Akt is associated with prostate tumorigenesis. In another study, MI has also been used to measure the activity of a network, dysregulated subnetworks were identified in diseases or different development stages using a heuristic search algorithm .
Although the methods described above can detect dysregulated interactions in diseases, this field is still in its early stage of development. Several important questions need to be addressed before they are widely applied, e.g., which method performs better, how to validate the detected interactions, and what is the nature of the interactions. Furthermore, the gene-based and interaction-based methods are complementary; thus, it is desirable to integrate both approaches to provide a comprehensive understanding of complex diseases.
Pathway-Based Methods to Detect Disease-Association
Pathway-based analysis was first developed for the analysis of gene expression profiling from microarray experiments to identify pathways that have modest but consistent expression changes in diseases [22•]. In the last 5 years, over 1,000 GWAS have been conducted searching for genetic association of common diseases, and pathway analyses of GWAS data have been extended to understand the underlying disease mechanisms [38, 39•]. More recently, integrative approaches have been developed to combine GWAS data with multiple “omics” data, such as mRNA expression, copy number alteration and the interaction network data (PPI and gene regulatory networks). Our pathway knowledge is far from complete, and strong evidents suggesting that disease-associated proteins tend to interact with each other, thus, the integration of interaction networks with GWAS data is expected to improve the association detection methods [28, 40, 41, 42, 43]. In this section, we will focus on the latest methodological development to pathway (network)-based detection of disease association, especially methods integrating GWAS with other “omics” data.
Many studies have demonstrated that integrating GWAS data with other “omics” can provide additional information and biological insight to conventional GWAS analysis, e.g., the underlying disease pathways that conventional methods failed to identify. Jia et al. [44•] integrated both GWAS and PPI network data to identify disease-associated subnetworks. The method first mapped all SNPs and their p values in a GWAS dataset to genes based on the SNP-gene association (the most significant p value among SNPs of each gene, was considered to represent the p value of the gene); then, genes and their p values were loaded onto a human PPI network; finally, dense module searching previously developed for gene expression datasets was used to search for subnetworks that locally maximize the proportion of low p value genes in the GWAS dataset. The method was applied to two GWAS datasets for breast cancer and pancreatic cancer, identified gene sets and the connections among these genes (subnetworks) in the context of PPI networks, while further analyses showed that several cancer-related pathways were enriched in both gene sets [44•].
To detect the disease-associated subnetworks from GWAS data and reduce the burden for multiple hypothesis testing problem, Pan introduced a network-based approach to give higher weight to subnetworks that contain known diseases genes or their partners . Two weighting schemes are proposed based on exponential and inverse probabilities. Compared with exhaustive search, this approach significantly decreases the search space. Using a human PPI network and 23 known ataxia-causing genes, the author demonstrated that ataxia-causing genes are clustered in the network, while subnetworks containing both disease genes and novel genes are detected . Taking advantage of previous knowledge about disease-associated genes, PPI networks and pathways, and eSNPs, Liu et al. [46••] proposed four frameworks to discover disease-associated interactions from GWAS data. Four types of SNP sets were constructed first, based on prior knowledge (e.g., all SNPs associated with genes in a single pathway, or SNPs in genes in a diseases-associated PPI network), and then exhaustive SNP–SNP interactions within each set were tested for disease associations using a logistic regression model. These approaches significantly decreased the search space and reduced hypothesis testing, and were applied to detect interactions in a GWAS dataset for type 2 diabetes (T2D). Interestingly, SNP interactions detected from four frameworks partially overlapped, and a connected network could be constructed [46••]. More importantly, disease associations of some SNP pairs were not tested because they are never present in the same pathway or network; additional testing revealed two interactions that were significantly associated with T2D, which gives additional support for the association between the network and T2D [46••].
Methods have been developed to combine expression data with GWAS data to identify disease-associated pathways [47, 48]. Xiong et al.  developed gene set association analysis, which simultaneously takes into account the SNP and gene expression variation to identify disease-associated pathways that are enriched for differential expression and/or trait-associated SNPs. In another study, pathways enriched for SNPs that associated with expression of genes (eSNPs) are targeted . Zhong et al. identified eSNPs that associated with the expression of genes in liver, subcutaneous adipose, and omental adipose [48, 49]. Each eSNP was tested for the association with disease, generating a p value; the p value is assigned to the gene whose expression is associated with the eSNP. A previous method based on GSEA is used to detect pathways enriched for eSNPs . This approach was applied to identify pathways associated with T2D, and many of the pathways identified have been proposed as important candidate pathways for T2D, novel associated pathways, including the tight junction, complement, and coagulation pathways, and antigen processing and presentation pathways .
Based on the observation that some genomic events (somatic mutations or copy number alterations) within oncogenic pathways exhibit a statistically significant level of mutual exclusivity, it has been proposed that mutation or alteration of two or more genes within the same oncogenic pathway does npt offer selective advantage for tumor cells [51••]. Ciriello et al. [51••] designed a novel method, mutual exclusivity modules in cancer to identify network modules in which oncogenic mutations are mutually exclusive, by integrating somatic mutations, copy number alteration, mRNA expression, and PPI network data and using correlation analysis. The application of this method to glioblastoma identified multiple gene pairs in PI3K, p53, and Rb pathways that show significant mutual exclusivity of mutation or genomic alterations [51••]. The authors suggested that the mutual exclusivity of mutations from two genes is due to the fact that the alteration to a second gene within the same pathway offers no further selective advantage [51••]. Similar network-based integrative methods have been proposed to identify pathways that drive cancer subtypes and cooperative genetic alterations in brain tumors, and infer the patient-specific pathway activities and driver genes [52, 53, 54].
Kim et al.  developed another approach to identify disease-causal genes and associated dysregulated pathways by integrating gene expression, copy number alterations, and interaction networks (including interaction data such as PPI, phosphorylation events, and protein–transcription factor interactions). An expression quantitative trait loci analysis was applied to determine the causal loci of each differentially expressed gene (target genes) by using a linear regression model on the differentially expressed genes and copy number alterations of 911 selected loci. To filter the false positive associations and determine the pathways associated with causal and target genes, a circuit flow algorithm was adopted to search the path from one causal gene to the target genes in the PPI, protein–DNA networks, and phosphorylation events. The results were further filtered by accounting for multiple hypothesis testing corrections or selecting the set of genes that best explained most disease cases.
The challenges in detection of disease-associated pathways include the lack of a comprehensive and accurate human interactome, poor understanding of the biological functions and role of intergenic regions of the human genome, and lack of comprehensive epigenetic datasets. PPI networks have been commonly integrated with mRNA expression, GWAS, and other “omics” data to identify disease-associated subnetworks. Although this approach can provide many novel insights for the underlying disease mechanisms, we should keep in mind problems like the poor correlation between expression of mRNA and protein expression , PPI networks which are likely tissue-specific and dynamic , and the existence of other important interactions, such as transcription factor binding to DNA, microRNA interactions with mRNA , and other potential genetic interactions . As many SNPs identified by GWAS are located in intergenic regions and their functional connections are unknown, it is currently challenging to include them appropriately in pathway analysis. Those SNPs might have strong effects onthe expression of distant genes by altering regulation or amplification status, i.e., as enhancers. Recent studies have provided evidence that SNPs in “gene deserts” can physically interact with the promoter via transcription factor binding and act in an allele-specific manner to regulate oncogene expression . Epigenetic events, such as DNA methylation and histone modification, are another layer of regulation of gene expression , and post-translational modifications of proteins are an obvious new area of interest and importance. Many studies have shown that all these types of alterations are associated with cancer and other diseases [62, 63], but it is challenging to integrate them with other data due to the lack of data and poor understanding of the functional mechanisms of regulation.
Prioritizing Candidate Disease Genes Using Network Knowledge
Gene prioritization aims to rank a list of candidate genes based on their likelihood to be disease-associated for further validation through integrative analyses of available data, such as literature, function annotation, sequence similarity, linkage and association data, and gene expression profiling [64, 65, 66, 67]. Recently, network knowledge, like disease networks and PPI or functional linkage networks have been integrated to prioritize candidates. Most of the early methods made the assumption that genes closer to each other in the network likely associate with similar diseases (guilt by association assumption) . For example, Wu et al.  constructed an integrated network by combining disease networks and PPI networks using disease–gene associations. A score is calculated to measure the concordance between the phenotype similarities and the functional genetic relatedness of genes. The candidate genes are ranked based on their score. It has been shown that in 709 out of 1,444 cases, this method successfully ranks disease genes at the top . Linghu et al. and others constructed functional linkage networks by integrating multiple “omics” data (PPI, coexpression, functional annotation, co-occurrence in literature, etc.), and applied it to prioritize candidate genes [70, 71, 72]. Goncalves et al.  compared the performance of the gene prioritization methods using a PPI network alone and network-integrating heterogeneous resources, and found that the integrative networks consistently perform better over a single PPI network in most cases.
Methods based on guilt by association have been questioned because of concern of statistical artifacts that results from node degree effects or exceptional edges [74•]. Kohler et al.  developed a method that takes into account the indirect interactions between candidate and disease genes. This method gave more weight to candidate genes that share more interacting partners with disease genes. More recently, methods using global network properties have been developed. Proteins with different functions are connected in interacting networks to reveal signaling or metabolic functions, so that PPI networks are organized into recurrent schemas . Based on these observations, Erten et al. [77••] proposed that disease genes likely exhibit topological profile similarity, and topological profiles of candidate genes can be measured and compared with disease genes, and used to prioritize potential candidates. The topological profile of a protein is represented by effective conductance, a concept from electrical circuit, which can be efficiently computed using random walks. If the protein products of candidate genes are topologically similar to the products of disease genes (i.e., the effective conductance of candidates and diseases are significantly correlated), then the candidate genes are likely associated with the diseases. Thus, the correlation of effective conductance is used to prioritize the candidate genes [77••]. Similar methods considering the network properties have also been proposed [73, 78]. Results show that these methods significantly outperformed those based on guilt-by-association assumptions [43, 73, 75, 77••, 78]. Machine learning approaches coupled with statistical procedures have also been applied to filter background SNPs, construct networks, and rank SNPs. McKinney and colleagues developed evaporative cooling (EC) to filter SNPs and detect the disease-associated networks from GWAS data [79, 80, 81]. This approach has been applied to GWAS data for bipolar disorder, and identified top-ranked SNPs in ANK3 and DGKH, which have been previously associated with bipolar diseases .
Although a few “top-ranking genes” from prioritization methods have been experimentally validated , the order or ranks of candidate genes are almost impossible to confirm and hard to biologically interpret, which makes it difficult to evaluate the overall performance of the prioritization methods. Moreover, a network of several genes with small effects may have stronger effect than the top-ranking gene. Thus, results from prioritization should be interpreted carefully.
Pathway-Based Diseases Classification
Accurate classification of diseases and disease stages is important for understanding of the underlying mechanism and design of efficient treatment. Gene expression profiling has been applied to identify cancer subtypes and predict treatment outcomes for over a decade [83, 84, 85, 86, 87]. In those early studies, genes are typically selected by their power to discriminate between different classes of disease without acknowledging the fact that genes are functioning by coordinately interacting with each other. The performance of those methods was not satisfactory, and the selected gene sets from different studies have limited overlap, even for the same cancer [84, 86], which is likely due to the genetic heterogeneity across patients and dysregulation at the pathway level instead of the gene level. Pathway- and network-based methods have been developed to improve the classification and cope with these issues.
Nevis and his colleagues developed pathway-based methods to detect cancer subtypes [88, 89••, 90]. Their approach identified gene expression signatures that reflect the activation status of several oncogenic pathways, and detected cancer subtypes using these signatures. To identify the expression signature, first, human mammary epithelial cells were infected with adenovirus expressing a specific oncogene, such as Myc, Ras, or Src. Then, the activation status for each oncogenic pathway was measured, and gene expression signatures that reflected the activities of a given pathway were selected. Finally, the signatures were used to detect cancer subtypes. The results showed that the identified patients in the same subtypes share similar clinical and biological properties [89••].
Ideker and colleagues proposed a method to identify subnetworks that correlated with cancer metastasis [37, 91]. Their method integrated PPI networks with gene expression profiling from metastatic or non-metastatic cancer cells. For one given subnetwork, MI was calculated to detect the correlation between expression profiling and metastasis. The subnetwork with optimal MI was searched using a greedy algorithm. Permutation was used to test the statistical significance of the subnetwork. The results showed that network-based methods achieve higher accuracy and are more reproducible than alternative approaches. This approach has been extended to integrate the proteins that were differentially expressed in colon cancer from proteomics experiments .
Many novel methods for pathway analysis have been developed and applied to many aspects of biomedical research to understand the underlying mechanism of diseases. The pathway-based approach outperforms previous methods because it is based on the activity of biologically connected and validated gene sets rather than on the expression levels of individual genes. The methods described above, that integrate genome wide expression or GWAS data with pathways and networks, are very promising, but they can be improved by taking into account other information, such as epigenetics. However, the field is still far from maturity due to incomplete pathway knowledge. Furthermore, pathway analysis is currently coding gene-centered, and non-protein coding elements (noncoding RNA, non-transcribed regions, and epigenetic marks) have not been sufficiently integrated in the analysis. Recent studies have demonstrated that 80 % of the human genome might be functional , and epigenetics plays an important role to maintain proper cellular functions [62, 94, 95, 96]. As the cost for HT data acquisition keeps decreasing dramatically, genomic, epigenomic, and ultimately proteomics data from biomedical research will be accumulated even more rapidly. This will accelerate the integration of information form coding and non-coding regions to significantly improve pathway analysis.
This publication was made possible in part by the Clinical and Translational Science Collaborative of Cleveland, UL1TR000439 from the National Center for Advancing Translational Sciences (NCATS) component of the National Institutes of Health and NIH roadmap for Medical Research and in part through support from the National Cancer Institute (P30-CA-043703), and the National Institute for Allergy and Infectious Diseases (P30-AI-036219).
Conflict of Interest
Y. Liu and M. R. Chance declares no conflicts of interest.
Human and Animal Rights and Informed Consent
This article does not contain any studies with human or animal subjects performed by any of the authors.
Papers of particular interest, published recently, have been highlighted as: • Of importance; •• Of major importance
- 15.Liu M, Liberzon A, Kong SW, et al. Network-based analysis of affected biological processes in type 2 diabetes models. PLoS Genet. 2007;3(6):958–72.Google Scholar
- 30.•• Liu Y, Koyuturk M, Barnholtz-Sloan JS, Chance MR. Gene interaction enrichment and network analysis to identify dysregulated pathways and their interactions in complex diseases. BMC Syst Biol. 2012; 6:65. This study introduces mathematic measures for dysregulated interactions and methods to identify them. PubMedCrossRefGoogle Scholar
- 42.• Califano A, Butte AJ, Friend S, Ideker T, Schadt E. Leveraging models of cell regulation and GWAS data in integrative network-based association studies. Nat Genet. 2012; 44(8):841–7. This article presents some examples for integrating of network and other “omics” data for disease association study. PubMedCrossRefGoogle Scholar
- 46.•• Liu Y, Maxwell S, Feng T, et al. Gene, pathway and network frameworks to identify epistatic interactions of single nucleotide polymorphisms derived from GWAS data. BMC Syst Biol. 2012; 6:S15. This study presents four frameworks for efficiently identifying interactions between SNPs associated with diseases. PubMedCrossRefGoogle Scholar
- 74.• Gillis J, Pavlidis P. “Guilt by Association” Is the exception rather than the rule in gene networks. PLoS Comput Biol. 2012; 8(3):e1002444. This study shows that functional information within networks is typically concentrated in only a small region of the network, and “guilt by association” cannot be applied across the whole network. PubMedCrossRefGoogle Scholar
- 77.•• Erten S, Bebek G, Koyuturk M. VAVIEN: An algorithm for prioritizing candidate disease genes based on topological similarity of proteins in interaction networks. J Comput Biol. 2011; 18(11):1561–74. This study presents method to prioritize genes based on topological property instead of “guilt by association”. PubMedCrossRefGoogle Scholar
- 91.Chuang FY, Rassenti LZ, Salcedo M, et al. Subnetwork-based analysis of chronic lymphocytic leukemia identifies pathways that associate with disease progression. Blood. 2011;118(21):1521–2.Google Scholar