Abstract
Background
Pathway-oriented experimental and computational studies have led to a significant accumulation of biological knowledge concerning three major types of biological pathway events: molecular signaling events, gene regulation events, and metabolic reaction events. A pathway consists of a series of molecular pathway events that link molecular entities such as proteins, genes, and metabolites. There are approximately 300 biological pathway resources as of April 2009 according to the Pathguide database; however, these pathway databases generally have poor coverage or poor quality, and are difficult to integrate, due to syntactic-level and semantic-level data incompatibilities.
Results
We developed the Human Pathway Database (HPD) by integrating heterogeneous human pathway data that are either curated at the NCI Pathway Interaction Database (PID), Reactome, BioCarta, KEGG or indexed from the Protein Lounge Web sites. Integration of pathway data at syntactic, semantic, and schematic levels was based on a unified pathway data model and data warehousing-based integration techniques. HPD provides a comprehensive online view that connects human proteins, genes, RNA transcripts, enzymes, signaling events, metabolic reaction events, and gene regulatory events. At the time of this writing HPD includes 999 human pathways and more than 59,341 human molecular entities. The HPD software provides both a user-friendly Web interface for online use and a robust relational database backend for advanced pathway querying. This pathway tool enables users to 1) search for human pathways from different resources by simply entering genes/proteins involved in pathways or words appearing in pathway names, 2) analyze pathway-protein association, 3) study pathway-pathway similarity, and 4) build integrated pathway networks. We demonstrated the usage and characteristics of the new HPD through three breast cancer case studies.
Conclusion
HPD http://bio.informatics.iupui.edu/HPD is a new resource for searching, managing, and studying human biological pathways. Users of HPD can search against large collections of human biological pathways, compare related pathways and their molecular entity compositions, and build high-quality, expanded-scope disease pathway models. The current HPD software can help users address a wide range of pathway-related questions in human disease biology studies.
Background
The study of biological pathways has become a central topic in molecular systems biology [1]. While the precise definition of "biological pathway" is still debatable, most researchers regard a biological pathway as a series of inter-connected cellular events among biomolecular entities. A biological pathway can be activated by extracellular stimuli and lead to persistent changes of the biochemical state of cells. There are three major types of molecular pathway events (or, events for brevity) that define biological pathways:
-
Signal transduction events. Common in signalling pathways (e.g., Wnt signaling pathway [2]), these events define the interactions among molecular entities during signal transduction cascades, i.e., how external stimuli such as molecules in the cellular environment are transduced into intracellular molecular signals that are relayed among different cellular organelles. Examples of signal transduction events in signalling pathways are protein-protein interactions, protein post-translational modifications, protein translocations, and protein complex formations/dissociations.
-
Enzymatic reaction events. Common in metabolic pathways (e.g., glycolysis pathway), these events define chemical reactions that metabolites (as either substrates or products) and catalytic enzymes are involved in. Examples of enzymatic reaction events are catabolic reactions (breaking down of larger molecules to produce energy) and anabolic reactions (synthesis of cellular components from smaller molecules).
-
Genetic regulation events. Common in genetic regulatory pathways (e.g., usually abbreviated as regulatory pathways), these events define the dependent relationships between regulatory entities, e.g., a transcription factor that binds to specific DNA binding motifs, and target entities, and a gene whose transcription is being regulated by a transcription factor. In addition to gene regulation events, regulatory pathways may also include sRNA and sRNA target gene regulation.
Collecting and modeling biological pathways are critical for interpreting "Omics" data [3]. For example, pathway knowledge has been used to identify new functional modules from gene expression profiles [4, 5] and relate gene mutations to one another in polygenic diseases such as breast cancer [6]. The development of biological pathways can also help build disease biology models, from which new hypotheses of targeted drugs and robust biomarkers may be developed. For example, molecular entities in FGFR1/PI3K/AKT signaling pathways, the Akt/PKB pathway, the Met pathway, and the Wnt signaling pathway have all been extensively investigated as potential cancer drug targets [7–10]. Novel drug discovery strategies to screen small molecules based on an entire pathway instead of particular protein targets can also be developed by designing global disease-related pathway inhibitors [11]. Pathway studies have also shown promise in molecular diagnostic applications, e.g., identifying efficacy and toxicity biomarkers [12], and building new multi-marker panels to improve prediction of disease prognosis and development of treatment plans [13]. Ongoing efforts to represent, develop, and apply pathway models will be crucial for future genome medicine and personalized medicine applications [14, 15].
While there are approximately 300 biological pathway-related online resources reported by Pathguide http://www.pathguide.org/today, these resources have been developed with variable degrees of data coverage, quality, and utility [1]. Examples of high-quality biological pathway database resources are: SPAD [16], CST [17], STKE [18] and COPE [19] for signaling pathways; TRANSFAC [20] for regulatory pathways; and KEGG [21], WIT [22], ExPASy [23], UM-BBD [24] and HumanCyc [25] for metabolic pathways. In addition, new databases such as HPRD [26], HAPPI [27], and STRING [28] have been developed to provide available high-throughput protein-protein interaction data to help fill gaps in rapidly growing molecular signaling pathway data. Recent efforts to expand biological pathway coverage beyond a single pathway event type have also been reported, e.g., NCI-PID [29], Reactome [30], BioCarta [31], Pathway Commons [32], Panther [33], Protein Lounge [34] and WikiPathways [35]. However, by comparing the coverage of high-quality protein-protein interactions from the HAPPI database [27] with annotated human pathways documented from the Reactome database, for example, it is not difficult to conclude that current coverage of known human biological pathway events is 1–2 orders of magnitude smaller than the theoretical maximum that can be defined by all known reliable human protein-protein interactions. Therefore, many pathway biology studies begin by expanding biological pathway data coverage and building high-quality integrative pathway models.
The most reliable approach to expanding human pathway data coverage without sacrificing data quality continues to be database integration. While there are several computational techniques that can help predict metabolic pathways [36], regulatory pathways [37, 38], and signaling pathways [39], they all have limited applicability and are thus beyond the scope of this work. However, integrating biological pathway from different data sources has been challenging, due to the heterogeneity in pathway data formats, representation schemes, and retrieval methods. For example, at the syntactic level, while many pathway databases such as the NCI-PID [29], Reactome [30], and KEGG [21] provide both molecular component and molecular interaction data as XML documents, Protein Lounge [34] and BioCarta [31] provide pathway details (including molecular entities and pathway events) only in TXT file and embedded pathway diagrams. Pathway ontology standards such as PSI-MI [40] or BioPAX [41] or GPML [42] can help resolve syntactic level data heterogeneity; however, these standards are relatively new and are available only in a few recent systems such as cPATH [43], NCI-PID [29], Reactome [30] and WikiPathways [35]. At the semantic level, incompatible pathway names, event representations, and molecular entity identifiers also poses challenges in querying pathway information across pathway data sources, particularly those with complementary information. Pathway names from different pathway data sources for the same pathway often differ slightly and therefore are poor choices as identifiers. Identifying pathways directly using pathway molecular entities can also be problematic, because the ensemble of molecular entities referring to the same pathway may vary among different annotation sources. Pathway molecular entities may be referred to with any public sequence identifier, which includes RefSeq ID, HGNC symbol, GenBank accession, SwissProt ID, UniProt name, KEGG ID, or IPI number. Furthermore, different databases may choose to provide available pathway information at different levels of molecular detail, e.g., with protein post-translational modification status, protein complex association status, or cellular location information. In summary, pathway data incompatibility at both the syntactic and semantic levels has inhibited the growth of high-quality integrative pathway data sources.
In this work, we describe the development of a new online integrated pathway database resource, the Human Pathway Database (HPD). HPD is an ongoing pathway data warehousing project, in which we integrate all three types of human pathway data and compile additional detailed information on pathway genes, proteins, metabolites, protein complexes, and pathway events. The concept of developing an organism-specific integrated pathway database resource is not unique, e.g., MAtDB [44] for managing all biological pathways for Arabidopsis and FlyMine [45] for managing both functional genomics and pathway data for Drosophila. Applying semantic-level data integration techniques, we collect, represent, and manage human-specific pathway data in HPD based on information from NCI-PID, Protein Lounge, KEGG, BioCarta, and Reactome databases. HPD provides a comprehensive view of current human biological pathway data, which consists of a total of 999 pathways and 59,341 molecular entities. Online HPD users may search the database for all relevant pathway information related to query protein(s), identify all pathways involving a query protein(s), and examine details related to pathway components, molecular events, and related pathways. Using three case studies, we show how to take advantage of HPD online and backend database querying capabilities to manage, query, and compare different types of biological pathways for systems biology studies. HPD is freely available online at http://bio.informatics.iupui.edu/HPD.
Results
Database content statistics
By integrating human biological pathway data from five major curated sources, we have developed HPD, a human pathway data warehouse. As of the current release, HPD contained a total of 999 human pathways that cover all three major types of pathway events. These pathways cover 59,341 molecular entities and 16,271 pathway events. As of April 2009, HPD contains the highest pathway data coverage among all human biological pathway databases publically available. Since HPD does not contain new pathways derived computationally, the quality of the database remains the same as that of each pathways curated from their respective source databases. A comparison of human pathways in HPD against several common human pathway data sources is shown in Table 1. Top 100 pathways, genes/proteins and compounds are listed in the additional file 1.
Scale distributions of integrated HPD pathways
Pathway scale can reflect the integrality of information needed for a biological topic. Here, we define pathway scale as the number of entities (nodes, including gene, protein, complex and metabolite) or events (edges, including interaction, reaction and regulation) involved in a pathway. We performed a statistical analysis on Pathway Scale Distribution (PSD) in the whole HPD, shown in Figure 1, from which we can see that the PSD defined by entity in Figure 1a is almost the same as the PSD defined by event in Figure 1b. This result indicates that the ratio of entity (node) number and event (edge) number in HPD pathways is almost fixed, which implies that the quality of HPD is consistent. We can also find that the PSD defined by gene/protein in Figure 1a is much closer to the PSD defined by entity or event than the PSD defined by metabolite, interaction, reaction and regulation, which suggest that using gene/protein number can represent pathway scale more precisely. This is the most important evidence not only for the implemental definition of pathway scale, but also for the definition of pathway-pathway similarity, both of which can be defined by the number of the Uniprot IDs mapped from genes or proteins in a pathway.
We can also notice that, since entities in a pathway here also include protein complexes, each of which will only count as one entity, the PSD defined by that entity is a little bit lower than the PSD defined by gene/protein in Figure 1a. Both of the results in Figure 1a and 2b suggest that the integration process of HPD is successful by considering pathway scales, but either small pathways or large pathways may still be under-represented in the whole HPD.
General online features
In Figure 2, we show the user interfaces of the Web-based online version of HPD. It supports both standard and customized user search options that allow them to specify a list of genes/proteins or keywords as the query input. Upon executing the queries, HPD can retrieve a list of related human pathways in an HTML table, with which users can further explore pathway details by clicking the hyperlink on a pathway ID in the table. In the pathway detail HTML table that pops up, all listings of molecular entities, events, related pathways, and reference resources of a specific pathway are shown. Users can also directly interact with advanced HPD features; by selecting the pathway-protein association matrix applet (See Figure 3 for an example) and the pathway-pathway similarity matrix applet (See Figure 4 as an example) Comprehensive hyperlinks were built so that users can search for new pathways based on visual analysis performed on the applets. User queried pathway data stored in HPD can also be downloaded as flat files without restriction to Academic users.
Case studies
To demonstrate the capabilities of HPD, we show three case studies of increasing complexity and biological significance to demonstrate how HPD could be used to solve real-world biological pathway problems.
Case study 1: searching for biological pathways and their components based on a single query protein
Using the standard query box provided at the HPD home page, we can search HPD for all biological pathways involving BRCA1_HUMAN (a major protein involved with breast cancer susceptibility). HPD returns a list of the top 20 BRCA1-related pathways, which are ordered by decreasing number of proteins that each pathway shares among all pathway pairs from retrieved pathways. The better the rank a retrieved pathway has, the more related it should be to both the query protein BRCA1 and all BRCA1-relevant pathways. In this list, highly-ranked pathways such as "Molecular Mechanisms of Cancer", "P53 Signaling", "DNA Repair Mechanism", and "BRCA1 pathway" are all well characterized signaling pathways in breast cancer. All pathways are hyperlinked to their own detailed pathway information pages, which include molecular entities (proteins, complexes and metabolites), related pathways, events, and external pathway images and reference articles. (See Figure 2 for details).
The Web page with the list of pathways related to BRCA1 also contains links to download data. Four types of data, pathway list, pathway-protein association matrix, and pathway-pathway similarity scores are downloadable as flat files.
Note that the pathway-protein association matrix contains proteins that are involved in the top 20 pathways retrieved based on the single protein query, sorted according to their descending maximal pathway involvement by activity count. BRCA1 related proteins are retrieved by pathway, with each of the proteins covered by at least two of the 20 pathways. A close examination reveals that many breast cancer susceptibility genes including BRCA1, BRCA2, P53, PCNA [46], FOXA1 [47] and STK6 [48] from recent individual studies and breast cancer biomarker genes such as ERBB2, FGFR2, M3K1, and PTEN [49, 50], have all been found in this list.
Particularly noteworthy is the Applet in the HPD Web page that shows all the query-related biological pathways with involved proteins in a heat map. In Figure 3, BRCA1 related pathways and involved proteins are sorted and used as two separate dimensions of the matrix. Mousing over a color-filled cell invokes an applet tooltip message, which shows the pathway and protein names.
HPD users can also visualize the pathway-pathway similarity matrix (Figure 4) which shows the similarity score among the BRCA1 related pathways. The pathway-pathway similarity matrix allows users to visualize a cluster of similar pathway pairs as a 2-D interactive heat map. This heat map allows users to right click on any cell (shown in Figure 4) to compare pathway pair on the heat map (future versions will include multiple pathway selection) by looking at the pathway-protein association matrix. This facilitates better understanding for deriving novel pathways most similar to BRCA1 related pathways.
Case study 2: developing pathway-pathway similarity networks from heterogeneous data sources
Using the advanced HPD search function online, a user can specify multiple proteins as the query input to obtain a list of most relevant pathways related to the query protein set. For example, if the user enters "BRCA1_HUMAN, FOXA1_HUMAN, STK6_HUMAN" as query inputs, a significant number of pathways (Table 2) related to any of the query protein inputs will be returned. To ensure retrieved pathways are relevant to the query protein inputs and to avoid overly restrictive filtering of related pathways (e.g., requiring all pathways retrieved to contain all proteins in the input query would be too restrictive), we can use the concept of pathway similarity (see Methods section for details) and apply a minimal pathway similarity threshold {Si, j≥ 0.2, and |P i ∩ P j | > 2}, i = 1...N, j = 1...N. The threshold indicates at least 20% minimal shared molecular entities with no fewer than 2 shared entities between two pathways. After applying this filter, 25 pathways and 39 pathway pairs are retrieved.
In Figure 5, we show a visual display of the pathway-pathway similarity network, using pathway similarity scores retrieved from HPD using ProteoLens [51]. In order to generate a comprehensive perspective of breast cancer pathways seeded with the three initial query proteins, all five types of data sources have been used. This observation strengthens the claim for the necessity of integrating pathways from heterogeneous sources. HPD pathways in this case study provide a good meta-model that connects our fragmented pathway knowledge together in pathway-pathway similarity networks. This global perspective, supported by integration of otherwise incompatible pathways from different sources, enhances the chance of exposing novel insights in the search for disease drug targets and biomarkers.
In Figure 6, we show a comparison of using the "Multiple Protein Search" feature among three databases: HPD, KEGG, and Panther. Three gene names BRCA1, FOXA1, and AURKA were used to build a common query gene set. The KEGG Genes database was manually searched and only one KEGG Pathway was found, using the "Search object in Pathways" functionality of KEGG (actual corresponding KEGG gene ID entered: hsa:672, hsa:3169, and hsa:6790). Panther had a "Batch ID search" which accepted the three gene symbols and retrieved only four unique pathways. HPD not only retrieved more pathways (n = 25), but also supported multiple identifier types as inputs, e.g., UniProt names.
Case study 3: developing integrated pathway models from heterogeneous sources
While pathway-pathway similarity networks are useful for generating global perspectives on the relationships between pathways, the next case study demonstrates how to connect different types of biological pathways within HPD to form integrated pathway networks. Since pathway data managed at HPD is integrated at the schematic level, "deep integration" and "deep integrative analysis" are possible. We will use two breast cancer-related proteins, BRCA1_HUMAN and FOXA1_HUMAN, as an example. According to the HPD data model (See additional file 2 for details), the table Connect_mol_updated contains mappings among pathways, interactions, and molecules. To search for all related pathways containing the above two proteins within the HPD data warehouse, we can execute the following SQL query:
SELECT pathway_name,mol_in, Mol_In_updated, name_in, Mol_out,
Mol_Out_updated, name_out, interaction_type, SYS_CONNECT_BY_PATH(Mol_In, '/') "Path"
FROM connect_mol_updated
START WITH name_in = 'BRCA1_HUMAN'
CONNECT BY nocycle PRIOR Mol_Out_updated=Mol_In_updated
and level < 3
INTERSECT
SELECT pathway_name,mol_in, Mol_In_updated, name_in, Mol_out,
Mol_Out_updated, name_out, interaction_type, SYS_CONNECT_BY_PATH(Mol_In, '/') "Path"
FROM connect_mol_updated
START WITH name_in = 'FOXA1_HUMAN'
CONNECT BY nocycle PRIOR Mol_Out_updated=Mol_In_updated
and level < 3;
We organize the results and present our final pathway analysis results in Figure 7, which shows many relationships not found in individual fragmented biological pathways separately. The FOXA1 transcription factor network contains 9-cis-Retinoic acid which regulates FOXA1 (Hepatocyte nuclear factor 3-alpha) [52]; it also contains BRCA1 (Breast cancer type 1 susceptibility protein) and CYP2C18 (Cytochrome P450 2C18), which is positively regulated by FOXA1 [53]. Arachidonic acid metabolism from KEGG PATHWAY involves arachidonic acid, which can be catalyzed by CYP2C18 [54] to produce 14,15-epoxy-5,8,11-eicosatrienoic acid. This intermediate product can be further catalyzed by EPHX2 (Epoxide hydrolase 2) to produce 14,15-dihydroxyeicosatrienoic acid [55]. In BioCarta, ATM signaling pathway involves BRCA1, which positively regulates RAD51 which regulates DNA Repair [56]. The "Presynaptic phase of homologous DNA pairing and the strand exchange" pathway of Reactome contains BRCA2, which binds with RAD51 to form the RAD51:BRCA2 complex [57]. Human protein-protein interactions data could also be retrieved and show that EPHX2 interacts with NSDHL (Sterol-4-alpha-carboxylate 3-dehydrogenase, decarboxylating). Phosphatidylinositol-3,4,5-triphosphate (PIP3), a lipid molecule generated by the action of phosphoinositide-3-kinase (PI3K), can be induced by a variety of stimuli. PIP3 is thought to be the major physiological substrate for PTEN, a phosphatase that can dephosphorylate many phosphatidyl inositides, which has been implicated in tumorigenesis [58]. Activation of protein kinase B (PKB)/Akt contribute to resistance to antiproliferative signals and breast cancer progression in part by impairing the nuclear import and action of p27 (CDKN) [59].
The integrated pathway model based on HPD pathways can be used as an investigative tool for disease diagnostic and therapeutic applications. For example, 9-cis-Retinoic acid is recognized as a possible breast cancer biomarker [60] and FOXA1 has gained increasing attention as a possible breast cancer therapeutic target [61]. The BRCA2-RAD51 interaction is essential for DNA repairs and has also been suggested as a novel target for anti-breast cancer drugs [62]. In addition to breast cancer, links between breast cancer and other diseases can be studied. For example, increased risk of hereditary prostate cancer is known to be a result of polymorphism in the CDKN1B (p27) gene [63]. Epoxide hydrolase 2 has been characterized as a key mediator molecule in hypertensive, cardiovascular, inflammatory, pulmonary, and diabetic-related diseases [64–66]. CHILD syndrome, an X-linked dominant trait with lethality for male embryos, can also be traced to mutations in NSDHL, a gene playing crucial roles in the cholesterol biosynthetic pathway [67].
Through this case study, we have shown the significance of integrating pathway information from different types and data sources. The interconnected network analysis offers researchers a rare opportunity to gain global perspectives on events previously perceived in isolation. This "deep integrative analysis" opportunity cannot be readily obtained by using multiple online pathway databases. For example, NCI Nature Curated Pathway Interaction Database has a 'Connected Molecules' functionality, which may only be used to find molecular connections within the same pathway data source. In all, the convenience of building new integrative pathway models with the new HPD may greatly facilitate new drug development and biomarker discovery.
Conclusion
We developed HPD as an integrated pathway database system to manage, query, and analyze human biological pathways. HPD integrates all three types of biological pathways from five heterogeneous pathway database sources at syntactic, semantic, and schematic levels, primarily based on data warehousing techniques driven by a unified pathway data model. Pathway molecules, interactions, chemical reactions, and similar pathways can be searched, displayed, and downloaded from a unified online user interface. The current HPD software can help users address a wide range of pathway-related questions in human disease biology studies.
While the human Reactome is still far from complete, an integrative pathway database such as HPD has the capability to help researchers establish a global perspective necessary for understanding molecular mechanisms and develop biomedical applications. We will further expand the database to include pathways from HumanCyc [25], Wikipathways [35], NetPath [68], Panther [33] and TRANSFAC [20]. We also plan to integrate protein-protein interaction data from HAPPI [27] with the aim of discovering novel pathways when combined with HPD. Additional functions will also be provided such as pathway reconstruction where users can select pathways and derive a reconstructed pathway expanded with protein-protein interaction data. With ongoing efforts, HPD can become a useful resource, linking proteins, genes, RNAs, signaling reactions, and gene regulatory events for systems biology applications.
Methods
Pathway data sources
We show an overview of the data integration process in Figure 8. Pathway data in HPD were collected or indexed from five different sources, i.e., NCI-Nature Curated data [29], BioCarta [31], Protein Lounge [34], Reactome [30] and KEGG [21]. The NCI-Nature Curated, Reactome and BioCarta data sets were all downloaded from Nature pathway interaction database Website and kept updated as of April 2009 release of the production HPD Website. In particular, the NCI-Nature Curated pathways are curated by Nature Publishing Group editors based on known biomolecular interactions and key cellular processes of signaling/regulatory pathways. The Reactome database was downloaded in December 2007 (Release version 22). Pathway molecules from both NCI-Nature and Reactome were identified by their UniProt identifiers and annotated with post-translational modification information. Pathway molecules from BioCarta were identified by Entrez Gene IDs without post-translational modification annotations. In all three data sets, each pathway was represented as a series of events, each of which consists of molecules in one of the following four roles: input molecule, output molecule, agent, and inhibitor. Content from the Protein Lounge was indexed by a web crawler accessing a publicly available Web site. The crawled content was verified with that provided as a site license to the authors and other users at Indiana University Simon Cancer Center. Since Protein Lounge is a commercial database that contains curated signaling, transduction, and metabolic pathways, we chose to index instead of integrate the full content into the data warehouse, which only indexed pathway-involving protein IDs and references to pathway diagram drawings. Original pathway molecules derived from Protein Lounge were identified by RefSeq ID or GI accession numbers. KEGG contains all known metabolic pathways and a small number of regulatory pathways and transport mechanisms. The KEGG PATHWAY database contains graphical representations of pathways and lists of enzymes and reactions within the pathways. All specific pathway maps and overviews were manually drawn and contained links to additional information on pathway compounds, enzymes and genes. Pathway molecules from KEGG are identified by E.C. Numbers which are mapped to KEGG Gene IDs and then to UniProt IDs. The total count of initial pathways, proteins, compounds, protein complexes, and pathway interactions are shown in Figure 8.
Pathway data integration
We developed a model-driven approach for syntactic, semantic, and schematic level integrations of heterogeneous pathway data. Since pathway data were collected in a variety of formats, Python XML/HTML data parsers were developed to convert them into a common tab-delimited textual format to ensure syntactic level data compatibility. The semantic compatibility of the data was enforced by cleaning up data attributes and data values to keep them consistent, using a standard data extraction, transformation, and loading (ETL) process characteristics of data warehousing-based data integration approaches. All pre-processed data were parsed, cleaned, and loaded into data warehouse staging tables before reaching their final database table destinations. To maintain schematic data compatibilities, we model relationships among different pathway concepts using an entity-relationship (ER) data model (for more details on the data model, please refer to the documentation on the HPD Website and additional file 2). We further mapped all the involved proteins or genes to their UniProt Name Identifiers [69] and metabolites to their KEGG compound IDs before loading the HPD pathway data into data warehouse tables defined by the ER data model. All HPD molecular entities, events, and pathways were assigned unique HPD-specific identifiers.
Online HPD software design
The HPD database was developed as a data warehouse application. The online version of HPD is a standard 3-tier Web application, which consists of an Oracle 10 g database at the backend database server layer, Apache/PHP server scripts at the middleware application Web server layer, and CSS-driven Web pages presented at the browser.
Pathway similarity measure
The pathway similarity measure can be defined as the extent of overlaps, e.g., common number of genes/proteins, shared between two different pathways. We define a pathway-pathway similarity score Si, jbased on equation (2) [70]. Both overlap and similarity score values can be downloaded from HPD Website.
Here, N denotes total number of pathways. P i and P j denote two different pathways, while |P i | and |P j | are the numbers of molecules that can be mapped to UniProt ID respectively in these two pathways. Their intersection P i ∩ P j denotes a common set of molecules that can be mapped to the same UniProt ID, while their union P i ∪ P j is calculated as |P i | + |P j | - |P i ∩ P j |. Here α is a weight coefficient among [0, 1], and we currently use α = 0.8 to count varying degree of contributions from calculations based both on the overlap (left item S L ) and the cover (right item S R ).
We can also make special considerations for subnetwork relationship (defined by the Nature Pathway Interaction database at http://pid.nci.nih.gov/. For subnetwork relationship, we define Si, j= 1.01, if pathway P i has a subnetwork as P j , and Si, j= -1.01 if pathway P i is a subnetwork of P j .
References
Cary MP, Bader GD, Sander C: Pathway information for systems biology. FEBS Lett 2005, 579(8):1815–1820. 10.1016/j.febslet.2005.02.005
Logan CY, Nusse R: The Wnt signaling pathway in development and disease. Annu Rev Cell Dev Biol 2004, 20: 781–810. 10.1146/annurev.cellbio.20.010403.113126
Werner T: Bioinformatics applications for pathway analysis of microarray data. Curr Opin Biotechnol 2008, 19(1):50–54. 10.1016/j.copbio.2007.11.005
Shen R, Chinnaiyan AM, Ghosh D: Pathway analysis reveals functional convergence of gene expression profiles in breast cancer. BMC Med Genomics 2008, 1: 28. 10.1186/1755-8794-1-28
Frasor J, Danes JM, Komm B, Chang KCN, Lyttle CR, Katzenellenbogen BS: Profiling of estrogen up- and down-regulated gene expression in human breast cancer cells: Insights into gene networks and pathways underlying estrogenic control of proliferation and cell phenotype. Endocrinology 2003, 144(10):4562–4574. 10.1210/en.2003-0567
Chittenden TW, Howe EA, Culhane AC, Sultana R, Taylor JM, Holmes C, Quackenbush J: Functional classification analysis of somatically mutated genes in human breast and colorectal cancers. Genomics 2008, 91(6):508–511. 10.1016/j.ygeno.2008.03.002
Chen GJ, Weylie B, Hu C, Zhu J, Forough R: FGFR1/PI3K/AKT signaling pathway is a novel target for antiangiogenic effects of the cancer drug fumagillin (TNP-470). J Cell Biochem 2007, 101(6):1492–1504. 10.1002/jcb.21265
Cheng JQ, Lindsley CW, Cheng GZ, Yang H, Nicosia SV: The Akt/PKB pathway: molecular target for cancer drug discovery. Oncogene 2005, 24(50):7482–7492. 10.1038/sj.onc.1209088
Mazzone M, Comoglio PM: The Met pathway: master switch and drug target in cancer progression. FASEB J 2006, 20(10):1611–1621. 10.1096/fj.06-5947rev
Takahashi-Yanaga F, Sasaguri T: The Wnt/beta-catenin signaling pathway as a target in drug discovery. J Pharmacol Sci 2007, 104(4):293–302. 10.1254/jphs.CR0070024
Schreiber SL: Target-oriented and diversity-oriented organic synthesis in drug discovery. Science 2000, 287(5460):1964–1969. 10.1126/science.287.5460.1964
Xu EY, Schaefer WH, Xu QW: Metabolomics in pharmaceutical research and development: Metabolites, mechanisms and pathways. Current Opinion in Drug Discovery & Development 2009, 12(1):40–52.
Fujita N, Tsuruo T: Survival-signaling pathway as a promising target for cancer chemotherapy. Cancer chemotherapy and pharmacology 2003, 52(Suppl 1):S24–28. 10.1007/s00280-003-0591-2
Garman KS, Nevins JR, Potti A: Genomic strategies for personalized cancer therapy. Hum Mol Genet 2007, 16(Spec No 2):R226–232. 10.1093/hmg/ddm184
Sander C: Genomic medicine and the future of health care. Science 2000, 287(5460):1977–1978. 10.1126/science.287.5460.1977
Tateishi HS Naoko, Kuhara Satoru, Takagi Toshihisa, Kanehisa Minoru: An integrated database SPAD (Signaling PAthway Database) for signal transduction and genetic information. Genome Informatics 1995, 6: 160–161.
CST – Cell Signaling Technology Pathway Database[http://www.cellsignal.com/]
STKE – Signal Transduction Knowledge Environment[http://www.stke.org/]
COPE – Cytokines and Cells Online Pathfinder Encyclopedia[http://www.copewithcytokines.de/]
Wingender E, Chen X, Hehl R, Karas H, Liebich I, Matys V, Meinhardt T, Pruss M, Reuter I, Schacherer F: TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Research 2000, 28(1):316–319. 10.1093/nar/28.1.316
Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 2000, 28(1):27–30. 10.1093/nar/28.1.27
Overbeek R, Larsen N, Pusch GD, D'Souza M, Selkov E Jr, Kyrpides N, Fonstein M, Maltsev N, Selkov E: WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Res 2000, 28(1):123–125. 10.1093/nar/28.1.123
ExPASy – Biochemical Pathways[http://www.expasy.ch/cgi-bin/search-biochem-index]
Ellis LBM, Hershberger CD, Wackett LP: The University of Minnesota Biocatalysis/Biodegradation Database: microorganisms, genomics and prediction. Nucleic Acids Research 2000, 28(1):377–379. 10.1093/nar/28.1.377
Romero P, Wagg J, Green ML, Kaiser D, Krummenacker M, Karp PD: Computational prediction of human metabolic pathways from the complete human genome. Genome Biology 2005, 6(1):R2. 10.1186/gb-2004-6-1-r2
Peri S, Navarro JD, Amanchy R, Kristiansen T, Jonnalagadda J, Vineeth S, Niranjan V, Muthusamy B, Gandhi TKB, Gronborg M, et al.: Human Protein Reference Database: Building a biological platform for systems biology. American Journal of Human Genetics 2003, 73(5):429–429.
Chen JYS, et al.: HAPPI: an Online Database of Comprehensive Human Annotated and Predicted Protein Interactions. BMC Genomics 2009, 10(Suppl 1):S16. 10.1186/1471-2164-10-S1-S16
von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M, Jouffre N, Huynen MA, Bork P: STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res 2005, (33 Database):D433–437.
Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, Buetow KH: PID: the Pathway Interaction Database. Nucleic Acids Research 2009, 37: D674-D679. 10.1093/nar/gkn653
Matthews L, Gopinath G, Gillespie M, Caudy M, Croft D, de Bono B, Garapati P, Hemish J, Hermjakob H, Jassal B, et al.: Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Research 2009, 37: D619-D622. 10.1093/nar/gkn863
BioCarta[http://www.biocarta.com/index.asp]
Pathway Commons[http://www.pathwaycommons.org/pc/home.do]
Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A: PANTHER: a library of protein families and subfamilies indexed by function. Genome Res 2003, 13(9):2129–2141. 10.1101/gr.772403
Protein Lounge[http://www.proteinlounge.com/]
Pico AR, Kelder T, van Iersel MP, Hanspers K, Conklin BR, Evelo C: WikiPathways: Pathway editing for the people. Plos Biology 2008, 6(7):1403–1407. 10.1371/journal.pbio.0060184
Romero P, Wagg J, Green ML, Kaiser D, Krummenacker M, Karp PD: Computational prediction of human metabolic pathways from the complete human genome. Genome Biol 2005, 6(1):R2. 10.1186/gb-2004-6-1-r2
Darvish A, Najarian K: Prediction of regulatory pathways using mRNA expression and protein interaction data: application to identification of galactose regulatory pathway. Biosystems 2006, 83(2–3):125–135. 10.1016/j.biosystems.2005.06.013
Romero PR, Karp PD: Using functional and organizational information to improve genome-wide computational prediction of transcription units on pathway-genome databases. Bioinformatics 2004, 20(5):709–717. 10.1093/bioinformatics/btg471
Frohlich H, Fellmann M, Sultmann H, Poustka A, Beissbarth T: Predicting pathway membership via domain signatures. Bioinformatics 2008, 24(19):2137–2142. 10.1093/bioinformatics/btn403
Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, et al.: IntAct: an open source molecular interaction database. Nucleic Acids Research 2004, 32: D452-D455. 10.1093/nar/gkh052
Luciano JS: PAX of mind for pathway researchers. Drug Discovery Today 2005, 10(13):937–942. 10.1016/S1359-6446(05)03501-4
van Iersel MP, Kelder T, Pico AR, Hanspers K, Coort S, Conklin BR, Evelo C: Presenting and exploring biological pathways with PathVisio. Bmc Bioinformatics 2008, 9: 399. 10.1186/1471-2105-9-399
Cerami EG, Bader GD, Gross BE, Sander C: cPath: open source software for collecting, storing, and querying biological pathways. Bmc Bioinformatics 2006, 7: 497. 10.1186/1471-2105-7-497
Schoof H, Ernst R, Nazarov V, Pfeifer L, Mewes HW, Mayer KF: MIPS Arabidopsis thaliana Database (MAtDB): an integrated biological knowledge resource for plant genomics. Nucleic Acids Res 2004, (32 Database):D373–376. 10.1093/nar/gkh068
Lyne R, Smith R, Rutherford K, Wakeling M, Varley A, Guillier F, Janssens H, Ji W, McLaren P, North P, et al.: FlyMine: an integrated database for Drosophila and Anopheles genomics. Genome Biol 2007, 8(7):R129. 10.1186/gb-2007-8-7-r129
Balmain A, Gray J, Ponder B: The genetics and genomics of cancer. Nature Genetics 2003, 33(3 s):238–244. 10.1038/ng1107
Nakshatri H, Badve S: FOXA1 in breast cancer. Expert reviews in molecular medicine 2009, 11: e8. 10.1017/S1462399409001008
Wirapati P, Sotiriou C, Kunkel S, Farmer P, Pradervand S, Haibe-Kains B, Desmedt C, Ignatiadis M, Sengstag T, Schütz F, et al.: Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures. Breast Cancer Research: BCR 2008, 10(4):R65. 10.1186/bcr2124
Easton DF, Pooley KA, Dunning AM, Pharoah PD, Thompson D, Ballinger DG, Struewing JP, Morrison J, Field H, Luben R: Genome-wide association study identifies novel breast cancer susceptibility loci. Nature 2007, 447(7148):1087–1095. 10.1038/nature05887
Gold B, Kirchhoff T, Stefanov S, Lautenberger J, Viale A, Garber J, Friedman E, Narod S, Olshen AB, Gregersen P: Genome-wide association study provides evidence for a breast cancer risk locus at 6q22. 33. Proc Natl Acad Sci U S A 2008, 105(11):4340–4345. 10.1073/pnas.0800441105
Huan T, Sivachenko A, Harrison S, Chen JY: ProteoLens: a visual analytic tool for multi-scale database-driven biological network data mining. BMC bioinformatics 2008, 9(Suppl 9):S5. 10.1186/1471-2105-9-S9-S5
Shimizu S, Kondo M, Miyamoto Y, Hayashi M: Foxa (HNF3) up-regulates vitronectin expression during retinoic acid-induced differentiation in mouse neuroblastoma Neuro2a cells. Cell Struct Funct 2002, 27(4):181–188. 10.1247/csf.27.181
Williamson EA, Wolf I, O'Kelly J, Bose S, Tanosaki S, Koeffler HP: BRCA1 and FOXA1 proteins coregulate the expression of the cell cycle-dependent kinase inhibitor p27(Kip1). Oncogene 2006, 25(9):1391–1399. 10.1038/sj.onc.1209170
Sacerdoti D, Gatta A, McGiff JC: Role of cytochrome P450-dependent arachidonic acid metabolites in liver physiology and pathophysiology. Prostaglandins & Other Lipid Mediators 2003, 72(1–2):51–71. 10.1016/S1098-8823(03)00077-7
Spector AA, Fang X, Snyder GD, Weintraub NL: Epoxyeicosatrienoic acids (EETs): metabolism and biochemical function. Progress in Lipid Research 2004, 43(1):55–90. 10.1016/S0163-7827(03)00049-3
Cousineau I, Abaji C, Belmaaza A: BRCA1 regulates RAD51 function in response to DNA damage and suppresses spontaneous sister chromatid replication slippage: Implications for sister chromatid cohesion, genome stability, and carcinogenesis. Cancer Research 2005, 65(24):11384–11391. 10.1158/0008-5472.CAN-05-2156
Tarsounas M, Davies D, West SC: BRCA2-dependent and independent formation of RAD51 nuclear foci. Oncogene 2003, 22(8):1115–1123. 10.1038/sj.onc.1206263
Ignatoski KMW, Livant DL, Markwart S, Grewal NK, Ethier SP: The role of phosphatidylinositol 3'-kinase and its downstream signals in erbB-2-mediated transformation. Molecular Cancer Research 2003, 1(7):551–560.
Liang J, Zubovitz J, Petrocelli T, Kotchetkov R, Connor MK, Han K, Lee JH, Ciarallo S, Catzavelos C, Beniston R, et al.: PKB/Akt phosphorylates p27, impairs nuclear import of p27 and opposes p27-mediated G1 arrest. Nature Medicine 2002, 8(10):1153–1160. 10.1038/nm761
Rubin M, Fenig E, Rosenauer A, Menendezbotet C, Achkar C, Bentel JM, Yahalom J, Mendelsohn J, Miller WH: 9-Cis Retinoic Acid Inhibits Growth of Breast-Cancer Cells and down-Regulates Estrogen-Receptor Rna and Protein. Cancer Research 1994, 54(24):6549–6556.
Nakshatri H, Badve S: FOXA1 as a therapeutic target for breast cancer. Expert Opinion on Therapeutic Targets 2007, 11(4):507–514. 10.1517/14728222.11.4.507
Ziogas D, Liakakos T, Lykoudis E, Fatourou E, Roukos DH: Exploring the role of BRCA1, BRCA2 and RAD51 as biomarkers for breast cancer. Radiother Oncol 2009, 90(1):161–162. 10.1016/j.radonc.2008.02.020
Chang BL, Zheng SL, Isaacs SD, Wiley KE, Turner A, Li G, Walsh PC, Meyers DA, Isaacs WB, Xu J: A polymorphism in the CDKN1B gene is associated with increased risk of hereditary prostate cancer. Cancer Res 2004, 64(6):1997–1999. 10.1158/0008-5472.CAN-03-2340
Inceoglu B, Schmelzer KR, Morisseau C, Jinks SL, Hammock BD: Soluble epoxide hydrolase inhibition reveals novel biological functions of epoxyeicosatrienoic acids (EETs). Prostaglandins & Other Lipid Mediators 2007, 82(1–4):42–49. 10.1016/j.prostaglandins.2006.05.004
Sinal CJ, Miyata M, Tohkin M, Nagata K, Bend JR, Gonzalez FJ: Targeted disruption of soluble epoxide hydrolase reveals a role in blood pressure regulation. Journal of Biological Chemistry 2000, 275(51):40504–40510. 10.1074/jbc.M008106200
Yu ZG, Xu FY, Huse LM, Morisseau C, Draper AJ, Newman JW, Parker C, Graham L, Engler MM, Hammock BD, et al.: Soluble epoxide hydrolase regulates hydrolysis of vasoactive epoxyeicosatrienoic acids. Circulation Research 2000, 87(11):992–998.
Bittar M, Happle R, Grzeschik KH, Leveleki L, Hertl M, Bornholdt D, Konig A: CHILD syndrome in 3 generations – The importance of mild or minimal skin lesions. Archives of Dermatology 2006, 142(3):348–351. 10.1001/archderm.142.3.348
NetPath – Signal Transduction Pathways[http://www.netpath.org/]
Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, et al.: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 2006, (34 Database):D187–191. 10.1093/nar/gkj161
Wu X, Chowbina SR, Li PM, Pandey R, Kasamsetty HN, Chen JY: Characterizing Mergeability of Human Molecular Pathways. , in press.
Acknowledgements
The HPD database was developed with research funding from Department of Defense (DOD) Breast Cancer Research Program (BCRP) Concept Award (W81XWH-08-1-0623) to Dr. Jake Chen. We thank Stephanie Burks and Joseph Rinkovsky from the University Information Technology and Services (UITS) at Indiana University for providing generous support in Oracle 10 g database administration and configuring the Web server for the project. We especially thank David Michael Grobe from UITS at Indiana University for thoroughly proofreading the manuscript and provided helpful comments for this project.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 11, 2009: Proceedings of the Sixth Annual MCBIOS Conference. Transformational Bioinformatics: Delivering Value from Genomes. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S11.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
JYC conceived the initial work, designed the method for the database construction, and drafted the manuscript. SRC and HNK implemented the design, and developed the database from integrated data sets. SRC, FZ and PML implemented the Web-based database interface. XW performed HPD pathway analysis to generate the first two case studies and statistical analysis. RP and SRC together implemented data warehousing strategies, provided ID mapping tables, and performed data processing, extraction, transformation, and loading. All authors are involved in the revisions of the manuscript.
Sudhir R Chowbina, Xiaogang Wu and Jake Y Chen contributed equally to this work.
Electronic supplementary material
12859_2009_3387_MOESM1_ESM.doc
Additional file 1: This additional file lists top 100 pathways ranked by degree (number of neighbour pathways, with which similarity score > 0); top 100 genes/proteins ranked by frequency, and top 100 compounds ranked by frequency. Here the frequency of a molecule entity (i.e. gene/protein or compound) also includes times appearing in same pathways. (DOC 66 KB)
12859_2009_3387_MOESM2_ESM.xls
Additional file 2: This additional file describes the pathway entity-relationship (ER) data model for HPD pathway integrations in detail. (XLS 44 KB)
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Chowbina, S.R., Wu, X., Zhang, F. et al. HPD: an online integrated human pathway database enabling systems biology studies. BMC Bioinformatics 10 (Suppl 11), S5 (2009). https://doi.org/10.1186/1471-2105-10-S11-S5
Published:
DOI: https://doi.org/10.1186/1471-2105-10-S11-S5