Functional Interaction Network Construction and Analysis for Disease Discovery

Protocol
Part of the Methods in Molecular Biology book series (MIMB, volume 1558)

Abstract

Network-based approaches project seemingly unrelated genes or proteins onto a large-scale network context, therefore providing a holistic visualization and analysis platform for genomic data generated from high-throughput experiments, reducing the dimensionality of data via using network modules and increasing the statistic analysis power. Based on the Reactome database, the most popular and comprehensive open-source biological pathway knowledgebase, we have developed a highly reliable protein functional interaction network covering around 60 % of total human genes and an app called ReactomeFIViz for Cytoscape, the most popular biological network visualization and analysis platform. In this chapter, we describe the detailed procedures on how this functional interaction network is constructed by integrating multiple external data sources, extracting functional interactions from human curated pathway databases, building a machine learning classifier called a Naïve Bayesian Classifier, predicting interactions based on the trained Naïve Bayesian Classifier, and finally constructing the functional interaction database. We also provide an example on how to use ReactomeFIViz for performing network-based data analysis for a list of genes.

Key words

Functional interaction Biological network Biological pathway Reactome Network-based analysis ReactomeFIViz Cytoscape Naïve Bayesian Classifier Java MySQL 

1 Introduction

Large-scale data sets are routinely generated in current biological studies using high-throughput techniques in order to understand disease mechanisms and develop better personalized precision therapies for patients. However, these data sets are usually gene or protein based. To understand the relationships among interesting genes or proteins, researchers usually have to project them onto biological pathways and network contexts so that these seemingly scattered genes or proteins can be visualized and studied together as groups.

Proteins and other gene products (e.g., miRNAs, lncRNAs) interact with each other and form a huge interaction network inside the cell. High-throughput experiments, such as Yeast 2-Hybrid (Y2D) [1] and mass spectrometry coupled with co-immunoprecipitation (MS CoIP) [2], have been used to generate large-scale protein–protein interaction data sets for many species, including human, mouse, fly, worm, and yeast. However, usually interaction data sets generated by these experiments have high false positive rates and physical interactions detected by these methods may not necessary play functional roles inside cells.

Based on highly reliable, expert curated pathways in the Reactome knowledgebase [3], the most popular and comprehensive open-source biological pathway database, we have developed a software pipeline to construct a human protein/gene functional interaction (FI) network covering around 60 % of total human genes. Figure 1 shows the overview of procedures and data sources to construct this FI network. We used protein pairwise relationships from protein physical interactions in human, mouse, fly, worm, and yeast, gene co-expression, Gene Ontology (GO) annotation, and protein domain interactions as features to train a Naïve Bayes Classifier (NBC) based on FIs extracted from Reactome. NBC is a simple machine learning technique, assuming independence among features. The trained NBC then was validated by independent FIs extracted from other non-Reactome pathway databases. The final FI network contains two types of FIs: annotated FIs extracted from curated pathway data and predicted FIs by the NBC. For gene regulatory network analysis, we have also included interactions between transcription factors and their targets from the ENCODE project [4] and the TRED database [5].
Fig. 1

An overview of the procedures used to construct the Reactome functional interaction network

For users to perform network-based data analysis using the Reactome FI network, we have also developed an app called ReactomeFIViz [6] for Cytoscape [7], the most popular biological network visualization and analysis platform. Users of our app can construct an FI subnetwork for a list of genes, perform network clustering to find network modules, annotate the subnetwork and modules, and then perform survival analysis for network modules.

In this chapter, we describe the procedures to construct the Reactome FI network and briefly introduce ReactomeFIViz. For more information, the reader is encouraged to refer to our other published materials [6, 8].

2 Materials

  1. 1.

    To construct the Reactome FI network, the reader is required to have some programming experience in Java and experience on using Eclipse, the most popular Java IDE (integrated development environment). Eclipse can be downloaded from http://www.eclipse.org and the source code for the FI construction can be downloaded from http://reactomedev.oicr.on.ca/download/tools/caBIG/FINetworkBuild.zip.

     
  2. 2.

    You will also need to install mysql database locally after downloading it from http://www.mysql.org.

     
  3. 3.

    To draw ROC (recover operating characteristic) curves for checking the performance of the Naïve Bayes Classifier, you need to install R. You can download the latest version of R from https://www.r-project.org.

     
  4. 4.

    It is necessary to download many data sources from the Internet to construct the Reactome FI network. We will give URLs in Section 3 “Methods” when we describe detailed procedures.

     
  5. 5.

    To use ReactomeFIViz, the reader should install the latest version of Cytoscape from http://www.cytoscape.org. In order to use Cytoscape, the following hardware and software requirements are suggested:

     
  6. 6.

    Hardware Requirements: A 2 Ghz or higher dual core or quad core CPU, 4GB or more physical memory, 512 MB video memory, 1 GB or more available hard drive space, a display that supports 1024 × 768 or higher resolution, and a high-speed Internet connection.

     
  7. 7.

    Software Requirements: To use the latest version of Cytoscape (version 3.3.0 in January, 2016), the reader is required to install Java 8 first. Java 8 is supported under Windows, MacOS, and Linux, and can be downloaded from http://www.java.com.

     

3 Methods

3.1 Setting Up Eclipse and Loading the FINetworkBuild Project

  1. 1.
    The project used to build the FI network is programmed in Java, and Eclipse is used for Java programming. After downloading Eclipse from http://www.eclipse.org and the Java source code for the project from http://reactomedev.oicr.on.ca/download/tools/caBIG/FINetworkBuild.zip, use menu File/Import to get the Project dialog and choose “Existing Projects into Workspace” (Fig. 2a), and then follow the step-by-step procedures to create a Java project in Eclipse (Fig. 2b).
    Fig. 2

    Import the Java source code to build a project for constructing the Reactome FI network. (a) (left): the dialog for importing the Java source code into a project in Eclipse. (b) (right): the package explorer view of the imported Java project for constructing the FI network

     
  2. 2.

    The configuration file (File name: configuration.prop) in the resources folder is used to configure files and parameters used by the project. Open this file in Eclipse and make sure the following three values are configured to what you want (seeNote1 for suggested values): YEAR, RESULT_DIR, DATA_SET_DIR.

     
  3. 3.

    We use log4j for logging output from program runs. The configuration for log4j is in file log4j.properties in the resource folder. Since a large amount of output will be generated during some methods’ running, it is recommended that you use a file to keep the log output (seeNote2 for some discussion on how to configure log4j).

     

3.2 Data Sources for ID Mapping

The construction of the Reactome FI network relies on many data sources, which may use different identifiers for genes and proteins. We use identifiers from UniProt [9] as the standard and PIR [10] for ID mapping.
  1. 1.

    UniProt is the authoritative protein knowledgebase used in FI network construction. We map all proteins and genes to UniProt accession numbers for normalization based on protein amino acid sequences to remove duplications. In this case, you will need two files from the directory current_release/knowledgebase/taxonomic_divisions in its ftp site, ftp.uniprot.org: uniprot_sprot_human.data.gz and uniprot_trembl_huma.dat.gz.

    For normalizing features used by the NBC, you also need another file from UniProt for protein isoform sequences: uniprot_sprot_varsplic.fasta. This file should be downloaded from the UniProt web site: http://www.uniprot.org/downloads, searching for Isoform sequences. The downloaded file should be named as is.

    All three files downloaded from UniProt should be placed in the same folder. The folder name should be configured in the configuration file (configuration.prop) for property, UNIPROT_DIR (seeNote3 for a suggested name).

     
  2. 2.

    We also use a mapping file to map Entrez gene ids to UniProt accession numbers. This file can be downloaded from the PIR ftp site ftp.pir.georgetown.edu/databases/: after logging in, go to idmapping/mapping_by_sp, and download file h_sapiens.tb. Remember where you place your file, and modify these two properties in the configuration file: IPROCLASS_HUMAN_FILE and ENTREZ_TO_UNIPROT_MAP_FILE_NAME. The second property will be used by the program later on.

     

3.3 Data Sources for Annotated FIs

Annotated FIs are FIs extracted from human curated pathways. In addition to Reactome, we have also extracted FIs from KEGG [11], NCI-PID [12], and Panther Pathways [13]. Pathways in other non-Reactome databases are imported into the Reactome curator tool project first for easy checking and then dumped into an enhanced Reactome database locally. In addition to pathways, we also import two interaction data sets for transcription factors and their targets from TRED [5] and ENCODE [4] as annotated FIs.
  1. 1.

    Reactome is used as the foundation for the FI network building. We use a slice of the Reactome central database to import data from other sources. You can find this database in the results folder after getting the project source. In addition to this slice database, we also need a snapshot of the original central database. For the latest version, send a help email to help@reactome.org.

    Unzip these two database dump files and import them into your local mysql after logging into it. For example, for loading the slice database from the mysql dump file, test_slice_55.sql, use the following commands:
    • create database reactome_55_plus_i;

    • use reactome_55_plus_i;

    • source test_slice_55.sql (Refer to Note4about loading the dump file into mysql).

    After loading these two databases into your local mysql, modify the following four properties in the configuration file with values you are using for your databases: REACTOME_SOURCE_DB_NAME, DB_USER, DB_PWD, REACTOME_GK_CENTRAL_DB_NAME.

    We will modify the loaded slice database, called reactome_55_plus_i in the aforementioned example, for the FI network construction. For database transaction protection, we need to change the table type in the database from MyISAM to InnoDB. To make this change, run the JUnit method changeMyISAMToInnodb() in class org.reactome.data.ReactomeDatabaseModifier.

    The slice database contains only some of the human proteins in UniProt. Copy other human proteins from the snapshot of the curated database by running the Java method, copyHumanReferenceGeneProducts().

    Finally, we need to modify the database schema by adding a new attribute, dataSource, and two new classes, Interaction and Targeted Interaction by running the following command in the mysql terminal:
    • source {absolute_path_to}/SchemaModification.sql.

    You should find a copy of SchemaModification.sql in the resources folder. Refer to Note5 on how to check if your database schema has been updated.

     
  2. 2.
    KEGG has many good pathway diagrams and disease pathways. We download needed files from KEGG’s ftp site: ftp.bioinformatics.jp. However, you will need to have a license first in order to access its ftp site. The following files are needed: kegg/xml/kgml/non-metabolic/organisms/hsa.tar.gz, kegg/pathway/pathway.list, /kegg/pathway/map_title.tab, kegg/pathway/organisms/has.tar.gz, kegg/genes/links/genes_uniprot.list.gz. Unzip these files and then extract a much smaller mapping file for human proteins using the following command (mac and linux only):
    • grep hsa: genes_uniprot.list > hsa_genes_uniprot.list.

    Specify values in the configuration file for the properties related to KEGG based on the directory in which you have placed your downloaded files (seeNote3 for more information): KEGG_DIR, KEGG_HSA_KGML_DIR, KEGG_CONVERTED_FILE, KEGG_ID_TO_UNIPROT_MAP_FILE.

     
  3. 3.

    NCI-PID is a database for cancer pathways, composed of two sources: pathways curated by NCI-PID curators and pathways imported from BioCarta and Reactome. For constructing the FI network, we will import NCI-PID curated pathways and pathways from BioCarta using their BioPAX Level 2 export, which can be downloaded from http://pid.nci.nih.gov/download.shtml. After unzipping, you should find two files: NCI-Nature_Curated.bp2.owl, BioCarta.bp2.owl. (seeNote6 about the NCI-PID database).

    For NCI-PID, the following parameters need to be specified in the configuration file based on the directory you have used (refer to Note3): NATURE_PID_DIR, NATURE_PID_CURATED, NATURE_PID_CURATED_CONVERTED, NATURE_PID_BIOCARTA, NATURE_PID_BIOCARTA_CONVERTED.

     
  4. 4.

    Pathways in the Panther database are imported based on the SBML format. They are downloaded from Panther’s ftp site: ftp.pantherdb.org. Two files are needed in directory pathway/current_release: SBML_{version}.zip and SequenceAssociationPathway{version}.txt ({version} should be replaced by an actual version number, e.g., 3.0.1), and these properties should be set in the configuration file based on the directory you use (refer to Note3): PANTHER_DIR, PANTHER_FILES_DIR, PANTHER_MAPPING_FILE, PANTHER_CONVERTED_FILE.

     
  5. 5.

    TRED provides functional interactions between transcription factors (TFs) and their targets. Two types of TF/target interactions are available in TRED: human curated and computational predicted. For constructing the FI network, we extract human curated ones only using the Hibernate API (http://hibernate.org/) after installing the TRED database locally in mysql. You can get a mysql dump file from data_archive/tred. You also need to configure the Hibernate configuration file (resources/TREDHibernate.cfg.xml) based on your database and change these two properties (Refer to Note3): TRED_DIR, TRED_CONVERTED_FILE.

     
  6. 6.

    ENCODE TF/target interactions were generated by the Gerstein group in Yale (http://encodenets.gersteinlab.org), originally published in Nature [4]. For constructing the FI network, we have kept a file containing these interactions in data_archive/encode/tf-targets.txt. To use this file, specify these properties in the configuration file according to names of the directory and the file (refer to Note3): ENCODE_DIR, ENCODE_TFF_FILE, ENCODE_TFF_CONVERTED_FILE. (Refer to Note7 for more information on using this data set.)

     

3.4 Data Sources for Predicted FIs

We use multiple features to predict functional interactions between two proteins, including physical interactions downloaded from iRefIndex [14] for human, mouse, fly, worm, and yeast, gene co-expression from two sources [15, 16], GO [17] annotation, and protein domain–domain interactions. Protein interactions from non-human species are mapped to human using Ensembl-Compara [18].
  1. 1.

    To download files from Ensembl-Compara, go to its download page: http://www.ensembl.org/info/data/ftp/index.html, choose Comparative/MySQL to go to its ftp site. After logging into its ftp site, choose ensemble_compara_xx (xx for release number), and then download seq_member.txt.gz, family.txt.gz, family_member.txt.gz, and ensembl_compara_xx.sql.gz. Unzip all files.

    Log into your mysql database, create an empty database named ensemble_compara_xx (xx for release number), and load the database schema with command: source ensembl_compara_xx.sql. Log out from mysql and load downloaded data files with the following command:
    • mysqlimport –u{mysql_db_user} -p --local ensembl_compara_xx family.txt family_member.txt seq_member.txt

    • (Refer to Note8for how to make sure you have loaded content correctly.)

    Assign correct values to the following properties in the configuration file based on values you are using for your ensembl_compara database: ENSEMBL_DIR, ENSEMBL_COMPARA_DATABASE, ENSEMBL_PROTEIN_FAMILIES (This file will be autogenerated in Subheading 3.5).

     
  2. 2.

    All protein–protein interactions used are downloaded from iRefIndex [14], which collects interaction data from several sources and then normalize them based on amino acid sequences and UniProt accession numbers. These interactions are provided in the PSIMI-TAB format and can be downloaded from iRefIndex’s ftp site via http://irefindex.org/wiki/index.php?title=iRefIndex: 6239.mitab.04072015.txt.zip (worm), 7227.mitab.04072015.txt.zip (fly), 9606.mitab.04072015.txt.zip (human), 10090.mitab.04072015.txt.zip (mouse), 559292.mitab.04072015.txt.zip (yeast S288c). Unzip these files, and update the properties with names starting with IREFINDEX in the configuration file.

     
  3. 3.

    We use two gene co-expression data sets for NBC training and prediction: Lee’s gene expression [15] and Prieto’s gene expression [16]. You can get these two data files from data_archive/LeeGeneExp and data_archive/PrietoGeneExp (seeNote9 about how we obtained these two data files), and make sure these two properties are correct in the configuration file: LEE_GENE_EXP_FILE_SOURCE, and PRIETO_PAIRS_FILE.

     
  4. 4.

    We use GO biological process (BP) annotation to check if a pair of genes has shared GO BP terms. You will need two files from GO: gene_association.goa_human from http://www.geneontology.org/GO.downloads.annotations.shtml, and GO.terms_and_ids.txt from http://www.geneontology.org/doc/GO.terms_and_ids. Make sure the file names are as described here and the GO_DIR value is correct in the configuration file.

     
  5. 5.

    We use domain–domain interactions from the pFam database [19]. To download domain files, log into pFam ftp site: ftp.ebi.ac.uk, go to the latest release folder (e.g., /pub/databases/Pfam/releases/Pfam29.0 in December, 2015), get these two files: pfamA.txt.gz and pfamA_interactions.txt.gz, and then unzip them. Make sure the value for property PFAM_DIR_NAME correct in the configuration file.

     

3.5 Construction of the Reactome FI Network

We have developed a software pipeline to construct the FI network. Usually you should simply run methods in Eclipse in the following order after all source files and databases have been downloaded, set up and configured correctly as described previously. The following methods are collected in class org.reactome.fi.FINetworkBuilder (refer to Note2 about logging).
  1. 1.

    prepareMappingFile: After this method runs, four files should be generated: Uni2Pfam.txt, SwissProtACIDMap.txt, ACIDMap.txt in the UniProt directory, and ENTREZ_TO_UNIPROT_MAP_FILE_NAME (see the actual file names in your configuration file) in the iproclass directory.

     
  2. 2.

    convertPathwayDBs (refer to Note10): Convert Pathways in KEGG, NCI-PID, and Panther, and TF/Target interactions in TRED and ENCODE into their own respective curator tool project files with extension names .rtpj. You may need to open these converted project files in the Reactome curator tool to see how they look and make sure they are correct.

     
  3. 3.

    dumpPathwayDBs (refer to Note11): Dump the converted curator tool projects into the extended Reactome database created at step 1 in Subheading 3.3.

     
  4. 4.

    dumpPathwayFIs: Extract annotated FIs from individual pathway sources.

     
  5. 5.

    prepareNBCFeatures: Check individual features used for training the NBC and generate necessary files for training.

     
  6. 6.

    trainNBC (refer to Note12): Before running this method, make sure you have assigned correct values to these two properties in the configuration file: ROC_CURVE_FILE and BP_DOMAIN_SHARED_PAIRS.

    After finishing the running of this method, draw an ROC (recover operating characteristic) curve to estimate the performance of the trained NBC by running an R script in the RSource folder, ROCCurveDrawing.R, after modifying the fileName variable in the code. You should get an ROC curve similar to Fig. 3.
    Fig. 3

    An ROC curve drawn based on data points generated from method trainNBC

    To calculate AUC (area under curve) of the drawn ROC, you need to install the ROC package from this web page: http://www.bioconductor.org/packages/release/bioc/html/ROC.html, and run a command like this in the R console: calculate.AUC(rocData). The result should be greater than 85 % usually.

     
  7. 7.

    predictFIs: Before running this method, set the cutoff value in the configuration file (usually it should be 0.50) using property CUT_OFF_VALUE and the file name for the predicted FIs, PREDICTED_FI_FILE.

     
  8. 8.
    buildFIDb: Before running the method, create an empty mysql database named as “FI_yyyy” (yyyy should be the year when the FI network is constructed), and make sure the following values in the Hibernate configuration file in the resources folder, funcIntHibernate.cfg.xml, are correct:
    • <property name=“connection.url”>jdbc:mysql://localhost:3306/FI_YYYY</property>

    • <property name=“connection.username”>{mysql_user_name}</property>

    • <property name=“connection.password”>{mysql_user_password}</property>

     

3.6 Installing Cytoscape

In previous sections, we described the procedures to construct the Reactome FI network stored in a mysql database (seeNote13 for the total time to construct the Reactome FI network). To allow researchers to perform network-based data analysis using this network easily, we have developed a Cytoscape app called ReactomeFIViz [6]. Cytoscape [7] is the most popular open-source network visualization and analysis platform. ReactomeFIViz is an application or “app” that extends the functionality of Cytoscape to explore Reactome pathways and search for disease-related pathways and network patterns using the Reactome FI network. In the following sections, we describe how to use ReactomeFIViz starting with installation of Cytoscape.

To install Cytoscape, follow these two steps:
  1. 1.

    Use your web browser and load the Cytoscape web site (http://www.cytoscape.org/), select the “Download” button, and follow the instruction for user registration and installing the software. Refer to Note14 for alternative Cytoscape installation methods.

     
  2. 2.

    Once installed, launch the Cytoscape application. Refer to Note15 for more information about how to launch Cytoscape.

     

3.7 Installing ReactomeFIViz

There are two ways to install ReactomeFIViz: directly from within the Cytoscape software (described in the following) or from the Cytoscape App Store (http://apps.cytoscape.org) (seeNote16 for this method).
  1. 1.

    Launch the Cytoscape software.

     
  2. 2.

    Go to the “Apps” drop-down menu and select the “App Manager…” feature.

     
  3. 3.

    In the “Search:” text box, type “ReactomeFI” (remove the quotation marks).

     
  4. 4.

    Select the “ReactomeFIPlugin” App (ReactomeFIViz was called ReactomeFIPlugin previously).

     
  5. 5.

    Click on the “Install” button.

     

3.8 Using ReactomeFIViz to Analyze a Gene List

ReactomeFIViz accesses the Reactome FI network allowing the user to construct an FI subnetwork based on a set of genes, query the FI data source for the underlying evidence for the interaction, build and analyze network modules of highly connected sets of genes, perform functional enrichment analysis to annotate the modules with pathway or GO annotations, and overlay a variety of information sources such as NCI Cancer Gene Index or GeneCard annotations.

In this section, we use a simple gene list derived from a glioblastoma multiforme (GBM) study [20] to demonstrate how to analyze a list of genes using the Gene Set/Mutation Analysis feature of ReactomeFIViz. For further details about the different supported file formats, seeNote17. For other features provided in ReactomeFIViz, consult the tutorial in http://wiki.reactome.org/index.php/ReactomeFIViz.
  1. 1.

    Launch the Cytoscape software.

     
  2. 2.

    Go to “Apps” in the drop-down menu, select “Reactome FI,” and then click the “Gene Set/Mutational Analysis” feature. A pop-up window will appear allowing upload of the gene list and configuration of the Gene Set/Mutation Analysis feature.

     
  3. 3.

    Choose a Reactome FI Network Version from the listed three versions. For this worked example, the most recent (2014) was selected (seeNote18).

     
  4. 4.

    Under File Parameters, choose a file containing genes you want to use to construct a functional interaction network and specify the file format. For this worked example, you can download the GBM data set from: http://reactomews.oicr.on.ca:8080/caBigR3WebApp/Cytoscape/GBM_genelist.txt.

     
  5. 5.
    Under FI Network Construction Parameters, select the “Fetch FI annotations” and “Use Linker genes” features, and then click OK (seeNote19). The constructed FI network based upon the uploaded gene list will be displayed in the Cytoscape network view panel (Fig. 4).
    Fig. 4

    The Reactome FI network created from the GBM gene list. An FI specific visual style will be created automatically for the FI network. The main features of ReactomeFIViz can be invoked from a pop-up menu, which can be displayed by right clicking an empty space in the network view panel

     
  6. 6.

    To query detailed information on selected FIs (seeNote20 for details on the different types of FIs displayed in the network), select the FI edge of interest. For example, select the line connecting the TP53 and MMP13 nodes, right click to invoke the pop-up menu as the cursor is hovering over the line. Choose under the “Reactome FI” option, the “Query FI Source” feature to display the summary of the source interaction information. In the pop-up window, select the line under “Reactome Sources” to display more detailed interaction data (seeNote21).

     
  7. 7.

    To identify “topologically unlikely” clusters (or groups of genes that are closer to each other on the network than you would expect by chance), run the network clustering algorithm [21] on the displayed FI network. To do this, invoke the pop-up menu by right clicking an empty space in the network view panel, and under “Reactome FI” option, select the “Cluster FI Network” feature. Nodes in different network modules will be shown in different colors (seeNote22). Selecting a module from the table panel will yellow highlight the genes in the network view panel, which belong to the selected module.

     
  8. 8.
    To annotate the individual clusters with pathways, invoke the pop-up menu by right clicking an empty space in the network view panel, and under “Reactome FI” option, select the “Analyze Module Functions” feature, and click on “Pathway Enrichment.” To perform GO term enrichment analysis, select one of the GO term categories: GO Molecular Function, GO Biological Process, or GO Cellular Component from the “Analyze Module Functions” feature. Selecting a module size or FDR cutoff value will allow the user to filter enrichment results. After the analysis is complete, a new table panel will appear displaying the results of the pathway enrichment analysis (Fig. 5).
    Fig. 5

    The table panel displaying the results of a pathway enrichment analysis

     
  9. 9.
    To overlay NCI Cancer Gene Index annotation onto the nodes within the network, invoke the pop-up menu by right clicking an empty space in the network view panel, and select “Load Cancer Gene Index.” In the Disease hierarchy panel, click through the different levels and terms, selecting and unfurling the relevant terms until reaching the lowest point of the hierarchy relevant to the query. Selecting a particular NCI Cancer Gene Index annotation term will yellow highlight the genes in the network, which have that particular annotation (Fig. 6).
    Fig. 6

    The NCI Cancer Gene Index feature overlays yellow highlighting on genes in the network that have breast carcinoma annotations

     

4 Notes

  1. 1.

    Values in the configuration file. In Reactome, we refresh our FI network once per year. Therefore, the value for “YEAR” should be set as the current year (e.g., 2015 for the 2015 version). RESTUL_DIR is usually set as “results/${YEAR}”. The {YEAR} parameter will be replaced by the value of “YEAR” automatically by the program. The value of DATA_SET_DIR is usually “datasets.”

     
  2. 2.

    Log4g.properties. There are two rootLoggers configured in the log4j.properties. It is recommended that you choose LOGFILE so that the log can be output to a local file. However, you may use the console output (A1) for some methods to view output in the Eclipse console view. It is suggested that you copy all log output from Eclipse to a file for future reference. A file called Combined_Logging.txt is created for this purpose.

     
  3. 3.

    Suggested names for data source files and output files. You can use existing file names in configuration.prop as examples. Usually you can just change dates in the directory names for newly downloaded files.

     
  4. 4.

    Loading mysqldump into mysql. In order to make sure mysql can find the dump file, start mysql from the directory containing the unzipped mysql dump file. Otherwise, you have to use the absolute path to your dump file.

     
  5. 5.

    Check updated database schema. To check if your database schema has been updated, download a copy of the Reactome curator tool from http://www.reactome.org/download, connect the tool to your database using menu Database/Database Browser/Schema View, and check if you can find two classes, Interaction and TargetedInteraction under the Event class, and an attribute called dataSource for the topmost class, DatabaseObject. To get the latest version of the database schema for the Reactome API used in multiple places, choose menu File/Export Schema to overwrite the original copy of schema in the resources folder.

     
  6. 6.

    NCI-PID database. The NCI-PID pathway was retired around the end of 2015. Its content has been hosted by the NDEx database. We have not tried using the NDEx database, and believe you may skip pathways from this source to construct a functional interaction network with enough coverage since the coverage in Reactome is much higher at present and can cover the loss from the NCI-PID pathways.

     
  7. 7.

    ENCODE TF/target interactions. There are many TF/Target interactions in the original ENCODE project release. These interactions were detected based on ChIP-seq (chromatin immunoprecipitation sequencing) and many of them may not play actually biological roles inside cells. We use a simple filter to choose TF/Target interactions that are supported by gene co-expression and/or GO BP annotation sharing.

     
  8. 8.
    Loaded Ensembl-Compara database. To make sure the procedures used to load the content for the database worked as expected, use the following query after logging into mysql and selecting your database:
    • SELECT COUNT(*) FROM seq_member WHERE taxon_id = 559,292 AND source_name LIKE ‘UniProt%';

      The returned value should be around 6200.

     
  9. 9.

    Gene co-expression data files. The original downloaded data files contain gene pairs only. In order to be used as features in NBC, we have mapped gene names to UniProt accession numbers. We used the downloaded UniProt data file to do this mapping. We used the SwissProt part of the UniProt data for doing the mapping for the Prieto data file. For Lee’s data file, we used the original downloaded mapping file, refseq-hs-annts.txt, chose correlations generated from three or more data sets, and then normalized with the latest UniProt data.

     
  10. 10.

    Method convertPathwayDBs. In order to keep the log file, make sure log4j.prop is configured to write output to a file and assign enough memory for running this method (e.g., −Xmx4G). Some of identifiers used by several databases cannot be mapped to UniProt accession numbers, including a very small number (less than 200) of KEGG ids from KGML pathways, as well as some of gene names used by TRED and ENCODE. You may see a FileNotFoundException at the end of the method run. You can just ignore it.

     
  11. 11.

    Method dumpPathwayDBs. It is recommended that you connect your curator tool to the database and check instances converted from other non-Reactome sources by searching based on the dataSource attribute. Some UniProt accession numbers may be duplicated if they cannot be mapped to Reactome ReferenceGeneProduct instances and are used in multiple data sources. This should be fine for generating the FI network after normalization.

     
  12. 12.
    Method trainNBC. After finishing running this method, you should also run class NBCGUITest to study the prediction results using different combinations of features (see Fig. 7 for a screenshot from running this class).
    Fig. 7

    Screenshot of the window generated by running class NBCGUITest

     
  13. 13.

    Time to construct the Reactome FI Network. All results should be output into the folder defined by property RESULT_DIR in the configuration file. The whole process including collecting all data sources usually takes 2 days with most of the time spent on collecting data files. The time used for Section 3.5 should be around half a day.

     
  14. 14.

    Cytoscape installation. You can install Cytoscape from a compressed archive distribution, from source, or use one of the automated installation packages that exist for Windows, Mac OS X, and Linux platforms.

     
  15. 15.

    Launching Cytoscape. The instructions for this will depend on your computer’s operating system. For Mac or Linux, double click on the Cytoscape icon in the Application/Installation folder. For Windows, open the Cytoscape folder through Start Menu, and then click on the Cytoscape icon. The Cytoscape desktop and the welcome screen should now appear.

     
  16. 16.
    Alternative ReactomeFIVizApp Installation. The reader can also directly install the ReactomeFIViz App from the Cytoscape App Store web site:
    1. (a)

      Launch Cytoscape software

       
    2. (b)

      Use a browser to load the Reactome web within the Cytoscape App Store page: http://apps.cytoscape.org/apps/reactomefiplugin

       
    3. (c)

      Click on the “Install” button. A dialog window should pop out showing the progress.

       
    4. (d)

      When installed, the button on the ReactomeFI plugin web page will change to “Installed.”

       
     
  17. 17.

    ReactomeFIViz supported file formats. Three different file formats are supported for gene set/mutation analysis: (1) Simple gene set with one line per gene; (2) Gene/sample number pair, which contains two required columns—gene and number of samples having the gene mutated, and an optional third column listing sample names (delimited by “;”); and (3) NCI MAF, which is the mutation file format used by The Cancer Genome Atlas project (http://cancergenome.nih.gov/). The HotNet Mutation Analysis, for doing cancer mutation data analysis, uses the NCI MAF format. ReactomeFIViz can also load a pre-normalized gene expression data file. In this case, an expression data file should be a tab-delimited text file with table headers. The first column should be gene names. All other columns should be expression values in different samples.

     
  18. 18.

    ReactomeFI Network Version. It is possible to get different results using different FI network versions because a later version may contain more proteins/genes and more FIs. From our experience, however, a significant FI network module is usually stable across multiple versions.

     
  19. 19.

    FI Network Construction Parameters. The “Fetch FI annotations” feature will display the FI attributes on the network. “Use Linker genes” will add “linker” genes to the network, which are not part of the uploaded gene list but known to interact with members of the gene list, increasing the connectivity within the network.

     
  20. 20.

    FI Edge Attributes. Edges will be displayed based on FI direction attribute values. In Fig. 4, “->” for activating/catalyzing, “-|” for inhibition, “-” for FIs extracted from complexes or inputs, and “---” for predicted FIs. See the Cytoscape “VizMapper” tab, “Edge Source Arrow” Shape, and “Edge Target Arrow” Shape values for details.

     
  21. 21.

    MMP13-TP53 FI Source Information. For the worked example, the MMP13-TP53 interaction is derived from TRED, where TP53 is the transcription factor with MMP13 the target, and is supported by functional analysis (i.e., Western and Northern blots) and literature citations (i.e., PMID 10753945, PMID 10415795).

     
  22. 22.

    Clustering Results. In the worked example, there are 7 modules containing 2 genes or more, with the largest modules containing 25 genes. Different colors are used only for first 15 modules based on their sizes.

     

References

  1. 1.
    Rual J-F, Venkatesan K, Hao T et al (2005) Towards a proteome-scale map of the human protein–protein interaction network. Nature 437:1173–1178CrossRefPubMedGoogle Scholar
  2. 2.
    Ewing RM, Chu P, Elisma F et al (2007) Large-scale mapping of human protein–protein interactions by mass spectrometry. Mol Syst Biol 3:89CrossRefPubMedPubMedCentralGoogle Scholar
  3. 3.
    Fabregat A, Sidiropoulos K, Garapati P et al (2016) The Reactome pathway knowledgebase. Nucleic Acids Res 44:D481–D487CrossRefPubMedGoogle Scholar
  4. 4.
    Gerstein MB, Kundaje A, Hariharan M et al (2012) Architecture of the human regulatory network derived from ENCODE data. Nature 489:91–100CrossRefPubMedPubMedCentralGoogle Scholar
  5. 5.
    Jiang C, Xuan Z, Zhao F et al (2007) TRED: a transcriptional regulatory element database, new entries and other development. Nucleic Acids Res 35:D137–D140CrossRefPubMedPubMedCentralGoogle Scholar
  6. 6.
    Wu G, Dawson E, Duong A et al (2014) ReactomeFIViz: a Cytoscape app for pathway and network-based data analysis. F1000Res 3:146PubMedPubMedCentralGoogle Scholar
  7. 7.
    Shannon P, Markiel A, Ozier O et al (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13:2498–2504CrossRefPubMedPubMedCentralGoogle Scholar
  8. 8.
    Wu G, Feng X, Stein L (2010) A human functional protein interaction network and its application to cancer data analysis. Genome Biol 11:R53CrossRefPubMedPubMedCentralGoogle Scholar
  9. 9.
    UniProt Consortium (2015) UniProt: a hub for protein information. Nucleic Acids Res 43:D204–D212CrossRefGoogle Scholar
  10. 10.
    McGarvey PB, Huang H, Barker WC et al (2000) PIR: a new resource for bioinformatics. Bioinformatics 16:290–291CrossRefPubMedGoogle Scholar
  11. 11.
    Kanehisa M, Sato Y, Kawashima M et al (2016) KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res 44:D457–D462CrossRefPubMedGoogle Scholar
  12. 12.
    Schaefer CF, Anthony K, Krupa S et al (2009) PID: the pathway interaction database. Nucleic Acids Res 37:D674–D679CrossRefPubMedGoogle Scholar
  13. 13.
    Mi H, Poudel S, Muruganujan A et al (2016) PANTHER version 10: expanded protein families and functions, and analysis tools. Nucleic Acids Res 44:D336–D342CrossRefPubMedGoogle Scholar
  14. 14.
    Razick S, Magklaras G, Donaldson IM (2008) iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinf 9:405CrossRefGoogle Scholar
  15. 15.
    Lee HK, Hsu AK, Sajdak J et al (2004) Coexpression analysis of human genes across many microarray data sets. Genome Res 14:1085–1094CrossRefPubMedPubMedCentralGoogle Scholar
  16. 16.
    Prieto C, Risueno A, Fontanillo C et al (2008) Human gene coexpression landscape: confident network derived from tissue transcriptomic profiles. PLoS One 3:e3911CrossRefPubMedPubMedCentralGoogle Scholar
  17. 17.
    Ashburner M, Ball CA, Blake JA et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25:25–29CrossRefPubMedPubMedCentralGoogle Scholar
  18. 18.
    Flicek P, Aken BL, Ballester B et al (2010) Ensembl’s 10th year. Nucleic Acids Res 38(Database):D557–D562CrossRefPubMedGoogle Scholar
  19. 19.
    Finn RD, Coggill P, Eberhardt RY et al (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res 44:D279–D285CrossRefPubMedGoogle Scholar
  20. 20.
    Cancer Genome Atlas Research Network (2008) Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455:1061–1068CrossRefGoogle Scholar
  21. 21.
    Newman MEJ (2006) Modularity and community structure in networks. Proc Natl Acad Sci U S A 103:8577–8582CrossRefPubMedPubMedCentralGoogle Scholar

Copyright information

© Springer Science+Business Media LLC 2017

Authors and Affiliations

  1. 1.Informatics and Biocomputing ProgramOntario Institute for Cancer ResearchTorontoCanada
  2. 2.Department of Medical Informatics and Clinical EpidemiologyOregon Health & Science UniversityPortlandUSA

Personalised recommendations