Pathways can be used to analyze various types of clinical data including whole exome sequencing (WES), transcriptomics and proteomics, genome-wide association studies (GWAS), metabolomics, or targeted chemical assay data. In this section, we provide several examples on how the created pathway models can be used for data analysis. Scripts and instructions for performing these analyses can be found online (bigcat-um.github.io/IEMPathwayAnalysis).
Pathway Analysis
The availability of pathways as machine-readable models enables us to visualize molecular data on the elements in the pathways and to perform advanced computational analysis including gene set enrichment and network analysis. While whole-genome sequencing is already well established in clinical practice, transcriptome analysis seems promising for diagnosing rare disease patients (Gonorazky et al. 2019).
Pathway analysis for transcriptomics data is a powerful tool to put the data into a biological context. As an example, we selected a publicly available transcriptomic dataset of patients with Lesch-Nyhan disease (LND) (geo:GSE24345, omim:300322) (Kang et al. 2011). The disease is caused by mutations in the HPRT1 gene (ensembl:ENSG00000165704), producing the hypoxanthine phosphoribosyltransferase 1 enzyme (uniprot:P00492) which enables cells to recycle purines. After statistical analysis with GEO2R (Davis and Meltzer 2007), the differential gene expression data was visualized on the previously described purine pathway (Fig. 73.3) using the WikiPathways app in Cytoscape (apps.cytoscape.org/apps/wikipathways) (Kutmon et al. 2014b); see Fig. 73.4. Cytoscape (www.cytoscape.org) (Shannon et al. 2003) is a widely adopted network analysis and visualization software tool, and the WikiPathways app provides direct access to the WikiPathways pathway content. The log2 fold changes (difference in gene expression between patients and control group) are visualized as a gradient from blue (less expressed) over white (not changed) to red (more expressed). A strong downregulation of the HPRT1 gene is immediately visible by the corresponding blue gene boxes. Based on this dataset, it seems that several of the other enzymes in the pathway (PNP, APRT, PRPS1) attempt to compensate for the lower transcript availability of HPRT1 and are upregulated in LND patients. Other enzymes, which are up- or downstream from the affected purine pathway, could also show changes relevant for the phenotype of the patient. Therefore, enrichment analysis on other pathway models can be used to assess the relevance of these pathways and find interesting affected processes in a certain dataset. Enrichment analysis showed several immune-related processes including the complement system as well as the Wnt signaling pathway affected in LND patients. Reimand et al. published a protocol that provides a description of pathway enrichment analysis and a practical step-by-step guide (Reimand et al. 2019).
Tutorial videos and R scripts for the data visualization (1A, Fig. 73.4) and pathway enrichment analysis (1B) can be found in the PathwayAnalysis folder on GitHub (bigcat-um.github.io/IEMPathwayAnalysis).
Pathways as a Source for Network Biology
Well-annotated pathway models are also an immensely useful source for network biology approaches. They can be used to extend biological pathways and networks (combinations of pathways) with additional knowledge, such as drug targets.
Using the WikiPathways app in Cytoscape, pathways can be visualized in two different views: the pathway view and the network view. The network view retains the existing biological interactions in the pathway only and applies an automatic layout allowing the pathways to be used for network analysis. Pathways can then be automatically enriched with additional information, e.g., regulatory mechanisms, known drugs, or disease annotations. The CyTargetLinker app for Cytoscape was specifically designed for this purpose (cytargetlinker.github.io) (Kutmon et al. 2019). In the following example, we will extend the purine metabolism pathway with approved drugs from DrugBank; see Fig. 73.5. The pathway contains fifteen enzymes, shown as rounded rectangles. In DrugBank, 21 drugs (shown as green diamonds) are reported to target one or more of these enzymes. Nine of the 15 enzymes are targeted by at least one drug, as shown in orange. The LND-causing gene HPRT1 (highlighted with the red rectangle) is targeted by three drugs, azathioprine, mercaptopurine, and ɑ-phosphoribosyl pyrophosphoric acid. The first two are immunosuppressants and inhibit HPRT1. The last molecule is a key substance in the biosynthesis of histidine, tryptophan, purine, and pyrimidine nucleotides and has unknown pharmacological action on HPRT1.
Tutorial videos and R scripts for the following example can be found in the Network Analysis folder on GitHub (bigcat-um.github.io/IEMPathwayAnalysis).
Linking Chemical (Biomarker) Data with RDF
The two examples in the previous sections work well for transcriptomics data. However, connecting metabolomics or chemical biomarker data to pathways can be challenging with the methods presented above, since the amount of data is significantly lower. Furthermore, as previously mentioned in Sect. 73.3.2, various levels of chemical detail exist within pathway models. To overcome these problems, we want to highlight another method to connect biomarkers data with pathways using an automated approach. The workflow for this example is visualized in Fig. 73.6a. One starts with chemical data from targeted (clinically validated) assays or with (un)targeted metabolomics data from mass spectrometry (MS) and/or nuclear magnetic resonance (NMR). After preprocessing of this data, for example, in R (Stanstrup et al. 2019), and annotation of relevant peaks with the corresponding chemical structure, one can also add an identifier from a (supported) database to the data. For subsequent data analysis steps, we used BridgeDb to link these identifiers to their corresponding InChIKey (a shortened version of the InChI) (Heller et al. 2015), which links the chemical structure to the original compound identifier in a machine-readable manner. The InChIKey consists of three parts (separated by a bar “-”); the first describes the general structure of a molecule, the second its stereochemistry, and the third the charge of the molecule. For example, “CKLJMWTZIZZHCS-REOHCLBHSA-M” is the InChIKey for L-aspartic acid monoanion, where the “M” stands for a −1 charge. In order to link these compounds to (metabolic) pathways, we will use a SPARQL query (Galgonek et al. 2016). This query is a structured method to ask questions to a database such as WikiPathways, which has been converted to the Resource Description Framework (RDF) format (Waagmeester et al. 2016). This RDF format unifies and harmonizes pathway data over all models, with several filtering and search options; data can be queried for specific species, DataNodes, literature, and more. Figure 73.6b provides an example of a SPARQL query where we investigated the occurrence of five compounds within the pathway models of WikiPathways. More details on the structure of the WikiPathways RDF and example queries are available on the Semantic Web portal: rdf.wikipathways.org. We also provide a beginners’ tutorial on how to write SPARQL queries in the SPARQL folder on GitHub (bigcat-um.github.io/IEMPathwayAnalysis).
To visualize the concepts of SPARQL and data mapping, we chose the five naturally charged proteinogenic amino acids (positive charge: aspartic acid (Asp, D) and glutamic acid (Glu, E); negative charge: arginine (Arg, R), histidine (His, H), and lysine (Lys, K)). Lines 6–10 in Fig. 73.6b provide the InChIKeys for these compounds, which we will link to pathway data in line 15 (wp:bdbInChIKey ?inchikey). The results of the example query reveal in which pathways these compounds can be found, which are listed from lowest to highest (line 22) occurrence counts (line 4) in pathways, to find the most relevant pathways for further analysis. This results in 15 pathways, which all have one of these compounds present (RDF data release 2021-03-10). However, since pathways can be annotated with the uncharged form of these amino acids as well, we can also query all compounds with the same stereochemistry independent of change status, obtaining a more complete result. By changing the end of each compounds InChIKey to their neutral counterpart (“CKLJMWTZIZZHCS-REOHCLBHSA-M” becomes “CKLJMWTZIZZHCS-REOHCLBHSA-N”) and changing line 15 to “rdfs:seeAlso ?inchikey,” we obtain a total of 38 pathways, with 15 pathways containing 2 of the 5 amino acids (either in their neutral or charged form). This example is explained in more detail in the SPARQL folder on GitHub (bigcat-um.github.io/IEMPathwayAnalysis).
The next step in this workflow downloads the relevant pathways from WikiPathways in Cytoscape; see Sect. 73.4.1 and 73.4.2 for further details. When moving to a network tool, more advanced analysis approaches can be used. The chemical data can be visualized on the nodes in the network, while the edges (interactions between nodes) can be used to visualize fluxomic data (if available). After the data has been added, interpretation can be facilitated by additional features in Cytoscape, such as visualizing the chemical structure of the compounds, connecting several pathways (as networks) to obtain a more complete overview, adding other experimental data (e.g., transcriptomics; see Sect. 73.4.1), or expanding the network with biological knowledge from other databases (see Sect. 73.4.2).