Background

Pathway analysis and gene-set enrichment analysis are both widely-used methods to identify significant molecular expression patterns from high-throughput data [1]. Over the last decade, biological pathways have provided natural sources of molecular mechanisms to develop diagnosis, treatment, and prevention strategies for complex diseases [24]. The various and massive functional genomics data are effectively analyzed by gene-set enrichment methods instead of individual gene analysis [58]. Pathway analysis and molecular signature discovery continue to reveal the association between genotypes and phenotypes, which are simply called molecular profiling or molecular phenotypes. At present, researchers intend to combine pathway and gene-set enrichment approaches and network module-based approaches to identify crucial relationships among different molecular mechanisms [1].

As sources of prior knowledge for molecular mechanisms, biological pathway databases are heterogeneous, cross multiple levels, and lack annotations [3]. Different pathway databases may yield divergent results from the same input data. When different databases yield similar results, applying multiple pathway data sources in a single analysis can generate a measure of validation. Unlike candidate pathway analysis, genome-wide pathway analysis does not require prior biological knowledge. In addition, genome-wide pathway analysis can reveal gene interactions across different diseases [3, 9] and multiple pathways [3, 10, 11]. Other studies based on an online integrated human pathway database (HPD) also provided associations between different pathways with diverse types, sizes, and sources [12, 13] on specific phenotypes. Although these efforts have greatly improved the efficiency of pathway analysis, our knowledge of biological pathways is still far from complete.

Gene signature data from the transcriptome level offers a complementary source of information to complete pathway knowledge. In a recent review, Khatri et al. [1] categorized pathway analysis into three generations of approaches: the first-generation "over-representation analysis" (ORA) approaches, the second-generation "functional class scoring" (FCS) approaches, and the third-generation "pathway topology" (PT) approaches. To overcome the limitations of ORA approaches (gene-level statistics), FCS approaches, such as gene-set enrichment analysis (GSEA) [6], were devised to include overall changes of gene expressions in each pathway/gene set (pathway-level statistics). Third generation approaches also include overall changes of gene expressions based on pathway topology--that is, their upstream/downstream positions within each pathway. Although these third generation approaches were meant to change our understanding of the underlying mechanisms of pathways, they lack information necessary to achieve this: the interdependence between pathways. Annotated knowledge from genome, transcriptome, post-transcriptome, and proteome levels can assist pathway and gene-set enrichment analysis.

Multi-level, multi-scale, knowledge-guided enrichment analysis can enable molecular phenotype discovery for specific human diseases. Currently, the acquisition of prior knowledge and systems modeling poses a challenge for developing tools that go beyond third-generation pathway analysis for disease-specific molecular profiling. Prior knowledge acquisition requires attention to updates and improves the available annotations with descriptive knowledge from multiple levels, especially for information on pathway microenvironment ("condition-, tissue-, and cell-specific functions of each gene") [1, 3]. Systems biology modeling must incorporate data from the view of systems biology to build systems with multiple scales, which can be used to generate hypotheses that will give detailed and accurate predictions of changes in systems. Both aspects of this challenge will be addressed by building a database not only containing disease-associated genes, transcript factors, proteins, and microRNAs, but also by organizing their relationships within and between pathways, gene signatures, and any gene sets from existing experiments or papers.

To meet the new challenges of molecular phenotype discovery, we developed in this work an integrated online database, the Pathway And Gene Enrichment Database (PAGED), to enable comprehensive searches for disease-specific pathways, gene signatures, microRNA targets, and network modules, by integrating gene-set-based prior knowledge as molecular patterns from multiple levels--the genome, transcriptome, post-transcriptome, and proteome. The new database can provide the following benefits to biological researchers. First, the new database consists of disease-gene association data, curated and integrated from Online Mendelian Inheritance in Man (OMIM) [14] database and the Genetic Association Database (GAD) [15]; therefore, it has the potential to assist human disease studies. Second, as of March 2012 it also contains all current compiled gene signatures in Molecular Signatures Database (MSigDB) [8] and Gene Signatures Database (GeneSigDB) [7]. Third, it further integrates with microRNA-targets from miRecords [16] database, signaling pathways, protein interaction networks, and transcription factor/gene regulatory networks, partially based on data integrated from the Human Pathway Database (HPD) [12] and the Human Annotated and Predicted Protein Interaction (HAPPI) [17] database. All gene sets or pathways are annotated with molecular interaction details whenever available. We integrated the following version of the database OMIM [14] (Feb. 2012), GAD [15] (Aug. 2011), GeneSigDB [7] (v. 4.0, Sept. 2011), MSigDB [8] (v. 3.0. Sept. 2010), HPD [12] (2009), HAPPI [17](v. 1.4) and miRecords [16] (Nov. 2010), which are the latest versions available. An advantage of our work lies in its representation of relationships between pathways, gene signatures, microRNA targets, and/or network modules. These gene-set-based relationships can be visualized as a gene-set association network (GSAN), which provides a "roadmap" for molecular phenotype discovery for specific human diseases. Using colorectal cancer expression data analysis as a case study, we demonstrate how to query PAGED to discover crucial pathways, gene signatures, and gene network modules specific to colorectal cancer functional genomics.

Methods

Data sources

We show an overview of the data integration process in Figure 1. Gene-set data were collected, extracted, and integrated from five major categories. The pathway data sources were from HPD [12], which has integrated 999 human biological pathway data from five curated sources: KEGG, PID, BioCarta, Reactome, and Protein Lounge. The genome-level disease gene relationships were from OMIM [14] and GAD [15]; the transcriptome-level gene signatures were from MSigDB [8] and GeneSigDB [7]; the post-transcriptome-level microRNA data were from miRecords [16]; and the proteome level data was from an integrated protein interaction database HAPPI [17], which has integrated HPRD, BIND, MINT, STRING, and OPHID databases.

Figure 1
figure 1

An overview of the gene-set data integration process. The figure shows the whole process of gene-set data integration and the basic statistics applied to gene-set data sources.

Gene-set data integration

We treat as gene sets all groups of genes, including disease-associated genes, pathway genes, gene signatures, microRNA-targeted genes, and PPI sub-network modules. The raw files from those data sources have various formats including plaintext, XML, and table. We have written Perl/Java parsers to convert them into a common tab-delimited textual format to ensure syntactic-level data compatibility. To integrate across different databases, we mapped the gene/protein IDs in all databases to official gene symbols. The gene-set gene data is stored in our backend ORACLE11g relational database. As of the current release, PAGED contained a total of 25,242 gene sets, 61,413 genes, 20 organisms, and 1,275,560 records. All gene set members are represented by the official gene symbols. All PAGED gene sets were assigned unique PAGED-specific identifiers.

Online software designing

The PAGED platform follows a multi-tiered design architecture. The backend was implemented as PL/SQL packages on an Oracle 11g database server and the PAGED application middleware was implemented on the Oracle Application Express (APEX) server, which bridged between the Apache webserver and the Oracle database server.

Gene-set similarity measurement

Referring to the pathway similarity definition introduced in [12], the similarity score S i, j of two different gene sets is defined by the following formula:

S i , j = α × S L + ( 1 α ) × S R = α × | P i P j | P i P j + ( 1 α ) × | P i P j | m i n { | P i | , | P j | } , ( i j )
(1)

Here, P i and P j denote two different gene sets, while |P i | and |P j | are the number of genes in each of these two gene sets. Their intersection P i P j denotes a common set of genes, while their union P i P j is calculated as |P i | + |P j | - |P i P j |. Here, α is a weight coefficient among [0, 1], which is used to count varying degree of contributions from calculations based both on the overlap (left item S L ) and the cover (right item S R ). S L is well-known as the Jaccard coefficient [18], which is often used to evaluate the similarity between two sets [19]. When a larger gene set covers a smaller one, we expect their similarity score to be high enough to identify them. In this situation, although the left item S L is a small number, the right item S R will be counted as 1.0 to make the final similarity score higher according to our definition in Equation (1), when taking an appropriate α value. Additional file 1 shows that how different α value could affect the distribution of the similarity scores of all cancer related gene sets. We found that when α fell in the interval of [0.7, 0.9], the score distribution would be close to a Poisson distribution. As we know, a Poisson distribution expresses the probability of a number of events occurring during a fixed period of time if these events occur with a known average rate and are time-independent since the last event. Therefore, we chose the middle value, α = 0.8, for the rest of the analysis. Our previous HPD paper [12] also validates the choice of 0.8 as the pathway similarity measurement.

Microarray data

Here we use colorectal cancer (CRC) expression data analysis as a case study to show how to discover crucial pathways, gene signatures, and gene network modules specific to colorectal cancer functional genomics. We downloaded a colorectal cancer microarray dataset GSE8671 from Gene Expression Omnibus, GEO http://www.ncbi.nlm.nih.gov/geo/[20]. This microarray dataset compared the transcriptome data of 32 prospectively collected adenomas with those of the normal mucosa from the same individuals. Hence, we have 32 CRC samples and 32 normal samples. We used maximal expression values for the same proteins mapped from different Probe IDs, the Affy package in BioConductor for quantile normalization, the built-in MicroArray Suite (MAS5) for background correction, and Limma in BioConductor for differential analysis, the result of which is represented as fold changes (FC) of CRC samples vs. normal samples.

Differential gene-set expressions

We use ABS_FC to denote the absolute value of fold change for each gene. We then define differential gene-set expressions here as

NORM_ABS_FC: The p*-norm of ABS_FC of all the available differential gene expressions in a gene set.

Usually, p-norm = ( i = 1 n ( x i ) p ) 1 p

For unification, we modify it as

p * -norm = ( ( 1 n i = 1 n ( x i ) p ) ) 1 p
(2)

In the implementation, p = 6 performs the best at accentuating highly differential expressions in a gene set.

Gene-set association network (GSAN) construction

To visualize the relationships between gene sets, we define a gene-set association network (GSAN) as a network of associations between different gene sets, in which the network element representation is as follows:

  • Node: Gene set

  • Edge: Association between two gene sets

  • Node size: Gene-set scale (Counting genes in each gene set)

  • Node color: Differential gene-set expression (NORM_ABS_FC)

  • Node line color: Gene-set data source

  • Edge width: Similarity score

Results

Database content statistics

Table 1 lists the detailed statistics for each data source and the overlap between each pair. For example, MSigDB contains 30,525 genes and GeneSigDB contains 36,791 genes. The number of overlapping genes between these two databases is 17,209. We found a synergistic effect from integrating these two signature databases, resulting in greatly increased gene-set coverage. The same effect was observed for all the remaining pair comparisons. These data sources proved to be complementary.

Table 1 Number of overlapping genes between different data sources

Gene-set scale distributions

The gene-set scale can reflect the integrality of information content of a biological topic. In this study, we define gene-set scale as the number of molecules (i.e., gene symbols) in a gene set. We performed a statistical analysis of the gene-set scale distributions of both PAGED and of its individual data sources. Figure 2 shows that each data source taken by itself is not very scale-free, especially for OMIM, GAD, and miRecords for higher scales and HPD, GeneSigDB, and MSigDB for lower scales. The scale distribution of PAGED is relatively scale-free on both the low end and the high end with a linear regression R-squared of 0.88. Additionally, the distribution of PAGED always lies above those of its data sources, indicating that the integration has enriched the number of gene sets in all scales instead of exhibiting a bias towards one particular scale. These observations indicate that the integration process of PAGED has resulted in a database that can take account of different gene set scales.

Figure 2
figure 2

Gene-set scale distributions for PAGED molecule data. A gene-set scale refers to the number of molecules (i.e., genes) involved in a given gene set. The frequency on the y-axis refers to the count of all gene sets falling into the category of a particular gene-set scale size on the x-axis. The distributions are plotted under log scale for both the x-axis and y-axis. The linear trend line in red represents linear regression of PAGED distribution and the linear equation and its R-Square are listed.

Online functionalities

In Figure 3, we show the user interfaces of the PAGED website. It supports both disease-based search and user-defined gene-list search. If users search the disease term in the home page (Figure 3A), PAGED will retrieve a list of related gene sets by directly matching the disease term with all the gene-set names; if users instead search a disease term in the advanced search page (Figure 3D), PAGED will first retrieve disease-relevant genes from OMIM and GAD and then use those genes to query the whole database, which will retrieve a gene-set list based on disease gene profiles that is more comprehensive than that of either OMIM or GAD individually. Users can also search PAGED using multiple genes in the home page (by delimiting them with a comma) to retrieve a list of related gene sets with the hits number and similarity scores (Figure 3A). In addition, users can upload a file of their genes with one gene per line on the advanced search page (Figure 3D) to perform the gene-based search. In the advanced gene-based search (Figure 3D), user can also perform an organism specific search though the majority of the gene-sets are human related. All the gene sets are hyperlinked to the original database, where user can further examine the detailed annotations of that specific gene set.

Figure 3
figure 3

An overview for the core functionality of the online PAGED website. (A) The PAGED home page providing search by either disease name or gene list; (B) a webpage containing the list of gene sets retrieved as a result of a disease query; (C) a webpage containing the list of gene sets retrieved as a result of a gene list query; (D) an advanced search page in which the user can either search disease name or upload a gene-list to search; (E) a browse page listing the gene sets, their data source and number of genes.

Upon executing the queries, PAGED can retrieve a list of related gene sets in an HTML table (Figure 3B, C) with their specific organism information included, which are downloadable as a comma-separated value (CSV) file. On the same page, there are links for downloading all the genes in those gene sets and the association between each gene set. In the gene set association downloading page, a simple heat map is provided for the visualization of gene set similarities. More sophisticated visualization will be provided in the near future.

Case studies

The following case studies use colorectal cancer expression data analysis as a case study to demonstrate how to discover crucial pathways, gene signatures, and gene network modules specific to colorectal cancer functional genomics.

Case study I: Searching disease-associated gene sets based on gene-set names

Using the standard query box provided at the PAGED home page, one can search for colorectal cancer in all biological gene sets. PAGED returns a list of gene sets, which can be ordered by decreasing number of genes contained by each gene set. In total, 45 gene sets from three data sources (i.e., OMIM, GAD and KEGG) have been retrieved. Not surprisingly, most of them are disease-related gene sets from either OMIM or GAD. Only 1 (i.e., "Colorectal cancer pathway") out 45 is from KEGG. The top 10 search results are listed in Table 2.

Table 2 Top 10 search results by querying colorectal cancer at the home page

Case study II: Searching disease-associated gene sets based on gene-set components

Next, a user can search with the same term colorectal cancer on the advanced search page, which uses the disease's gene profile to search for gene sets. PAGED first obtained 203 colorectal cancer related genes from OMIM and GAD. Then, it used those genes to retrieve a total of 4,932 gene sets with at least 2 hits. Since we are more interested in gene sets other than disease terms, we excluded those gene sets from OMIM and GAD for further analysis. To rule out the possibility that those gene sets were hit randomly, we did a Fisher's exact test to calculate the p-value between those 203 genes and every retrieved gene set. Finally, we obtained 3,879 gene sets with a p-value < 0.05 and hits ≥2. These gene sets are from all data sources, including MSigDB, GeneSigDB, miRecords, and all pathway data sources from HPD. Both the number of gene sets and their variety support the conclusion that advanced disease search based on gene profiles are more comprehensive than a simple disease search.

Table 3 shows the top results ranked by decreasing number of hits from each data source. Protein Lounge suggests "Molecular Mechanisms of Cancer," "Akt Signaling," and other important pathways in colorectal cancer; BioCarta suggests "wnt signaling pathway"; and NCI Nature curated suggests "Canonical Wnt signaling pathway." These are all very important pathways in colorectal cancer development [21]. Similarly, "Colorectal cancer" and "p53 signaling pathway" from KEGG, "SIGNAL_TRANSDUCTION" and "KEGG_PATHWAYS_IN_CANCER" from MSigDB, and cancer-related signatures/microRNA from GeneSigDB/miRecords from Table 3 reveal a comprehensive picture of the important gene sets involved in colorectal cancer. Thus, the results of the advanced search yield more insights about colorectal cancer mechanisms than those of the simple search.

Table 3 Top search results of colorectal cancer advanced search

Case study III: Searching gene sets similar to user-defined query gene sets

To use the gene-based search from PAGED, we first analyzed a colorectal cancer microarray dataset GSE8671 with BioConductor to identify the differential genes. We selected the top 100 genes ranked by the absolute fold change with p-values less than 0.05. After querying PAGED with those 100 genes, we obtained 1,707 gene sets, out of which 1,152 also satisfied Fisher's exact test of a p-value less than 0.05. Those gene sets span from all the data sources except BioCarta and miRecords. Table 4 lists the top results ranked by the number of hits. Most of them are cancer-related gene sets. Specifically, "SABATES_COLORECTAL_ADENOMA_DN" and "SABATES_COLORECTAL_ADENOMA_UP" from MSigDB and "Intestine_Vecchi07_1024genes" and "Colon_Kim04_235genes" from GeneSigDB supports the importance of those 100 query genes to colorectal cancer. This case study also shows the complementary nature of MSigDB and GeneSigDB and thus the benefit of integrating them, which has also been proved by [22]

Table 4 Top search results of gene-based search from microarray datasets

Case study IV: Building disease-specific gene-set association networks (GSANs) based on gene-set similarities

With the unique top 50 gene sets related to colorectal cancer from disease search and gene search (Table 3 and Table 4), we next investigated the gene-set associations between them; 863 associations were found by overlapping the gene symbols between each pair of gene sets, out of which 642 also satisfied Fisher's exact test of a p-value and FDR less than 0.05. A network visualization using Cytoscape [23] is shown in Figure 4. Most of those gene sets are connected to one another, and a few share a large number of genes, suggesting that they form a collaborative unit in colorectal cancer.

Figure 4
figure 4

CRC-specific gene-set association network (GSAN) on the top gene sets from colorectal cancer study. Node size: Gene-set scale (Counting genes in each gene set); Node color: Gene-set data source; Edge width: Similarity score (≥ 0.1). All gene sets are highly connected to each other, suggesting their collaborative functions in colorectal cancer.

Case study V: Prioritizing disease-associated gene sets by using differential gene-set expressions

First, the differential gene expression value (ABS_FC) for each gene in a gene set is calculated from the differential analysis based on the microarray data GSE8671. Second, the differential gene-set expression value (NORM_ABS_FC) for each gene set in the CRC-specific GSAN is calculated by using Equation (2). Third, a CRC-specific GSAN with differential gene-set expressions is shown in Figure 5, in which node size represents gene-set scale (Counting genes in each gene set); node color represents differential gene-set expression (NORM_ABS_FC); node line color represents the gene-set data source; and edge width represents the similarity score. By considering differential gene-set expressions for each gene set, we prioritize top-selected gene sets as shown in Table 5. Most of top-ranked gene sets are closely related to colon tissue, colorectal cancer, or other cancers, which implies that our database can not only support comprehensive disease-associated gene-set searching and browsing, but also accurate, disease-specific gene-set prioritizing by using the concept of differential expressions at the gene-set level.

Figure 5
figure 5

CRC-specific gene-set association network (GSAN) with differential gene-set expressions. The differential gene expressions are from the differential analysis based on the microarray data, GSE8671. Node size: Gene-set scale (Counting genes in each gene set); Node color: Differential gene-set expression (NORM_ABS_FC); Node line color: Gene-set data source; and Edge width: Similarity score (≥ 0.1).

Table 5 Top 20 gene sets ranked by differential gene-set expressions in the CRC-specific gene-set association network (GSAN)

Discussion

In the near future, we will improve gene-set similarity algorithms by using a global PPI network to calculate their distance. This would provide a more robust measurement for web interface development, and we plan to add a disease browsing function based on disease ontology and a network visualization function to show the gene-set association dynamically. Our final goal is to perform multi-scale network modeling for molecular phenotype discoveries by integrating differential expressions with pathway and network topologies. The current release of PAGED provides a solid foundation for us to develop third-generation pathway analysis tools [1].

Conclusions

We developed PAGED, an online database that provides the most comprehensive public compilation of gene sets. In the current release, PAGED contains a total of 25,242 gene sets, 61,413 genes, 20 organisms, and 1,275,560 records from five major categories: the pathway data from HPD, genome-level disease data from OMIM and GAD, transcriptome-level gene signatures from MSigDB and GeneSigDB, the post-transcriptome microRNA data from miRecords, and proteome-level data from HAPPI. The number of overlapping genes between each data source, gene-set scale distribution, and case study in colorectal cancer shows the synergistic effect of integrating data sources, which greatly facilitate access to gene-set-based prior knowledge. The current PAGED software can help users address a wide range of gene-set-related questions in human disease biology studies.