Background

Studies that utilize genome-wide profiling methods which attempt to explain the differences between two or more experimental conditions such as cells treated with a drug vs. control, diseased tissue vs. normal, gene or protein expression at different time points during cellular differentiation or reprogramming, or candidate gene lists harboring mutations associated with a particular disease, produce lists of genes/proteins without apparent functional relationship. These lists are commonly analyzed using software tools and databases that map genes to known pathways or construct subnetworks that connect input lists of genes using known protein-protein or other types of molecular interactions [110]. Such methods have been instrumental for organizing and reusing prior knowledge to understand new high-content experimental results. Prior knowledge networks, in particularly protein-protein interaction networks, have been useful for predicting unknown functions for genes [11, 12], new interactions between proteins [13], novel disease genes [14], and guiding experimental research efforts by prioritizing the most likely regulators to test at the bench [15]. The resultant subnetwork diagrams from these analyses are useful because this prior knowledge, displayed as a network diagram, contains information about the relationships between the genes identified experimentally. This approach also abstracts the genes from the query list to higher order biological functions, allowing for the identification of novel relevant genes.

Software tools that provide users the ability to build subnetworks from lists of genes using prior knowledge networks are continually gaining popularity. For instance, a system that we developed a few years ago, Genes2Networks, utilizes twelve protein-protein interaction databases to connect lists of mammalian gene products using a shortest path algorithm [1]. Similarly, the software VisAnt version 3.5 goes a step further to automatically compute enrichment for gene ontology (GO) terms in identified PPI subnetworks [2]. Integrating PPIs, gene regulatory interactions, metabolic networks, and cell signaling networks, ConsensusPathDB provides methods to find connections between human, mouse and yeast genes [3]. Cytoscape, one of the leading academic platforms for building and visualizing networks, through its modular plug-ins, provides ways to construct networks, find paths between nodes, and compute network properties in an integrative manner [16]. Similar functionality is available in PatikaWeb [4], a web application with an underlying large protein interaction database. STRING, arguably the most comprehensive molecular interaction database, contains many different interactions including protein-protein and co-expression with assigned confidence scores [5]. Similar functionality is also available in BioPixie, initially developed for yeast but more recently extended to cover the mouse [17]. Visualization tools such as N-Browse [6], AVIS [18], FNV [19], and Cytoscape Web [20] display subnetworks from heterogeneous types of data sources with different color edges and nodes to represent different types of links and nodes on the web. GeneMANIA [7], another subnetwork generation tool, utilizes Cytoscape Web to display known and predicted protein-protein interactions, co-expression interactions, interactions based on shared pathways, and genetic interactions. So far, most subnetwork building software tools only utilize a few types of prior knowledge networks, mostly protein-protein interactions, co-expression, metabolic, and cell signaling pathway networks. Here we extend on these efforts by generating 14 functional association networks (FANs) from gene set libraries and combine them with a large-scale network of mammalian protein-protein interactions. The FANs were systematically generated by converting gene set libraries to networks, connecting pairs of genes based on their shared functional annotations. These functional association networks (FANs) together with protein-protein interaction networks are our background knowledge database for building and visualizing subnetworks from input lists of genes. Keeping functional relationships separate, we allow users to control what layers of functional associations they wish to integrate into their analysis. This system is delivered as a web based interactive tool called Genes2FANs. To demonstrate the utility of the Genes2FANs approach we applied the software to connect lists of disease genes for 90 diseases that have many known mutated genes. We find an inverse correlation between the number of protein-protein interaction links and the number of functional annotation links identified when connecting lists of disease genes. This inverse correlation separates complex diseases into two classes: those that are protein interaction centric, including many cancers, and those that are functional centric, including complex spectral disorders such as autism and type-2-diabetes.

Implementation

Methods for constructing the functional association networks

The first step in assembling the FANs was to gather data spread across a wide variety of databases and online sources. Besides collecting a comprehensive list of available protein-protein and cell signaling networks (see below), we also collected and generated gene set libraries that we later converted to FANs. Gene set libraries store sets of genes in a gene matrix transposed (GMT) file with rows containing a set of genes symbols associated with a given functional term. Using this format we were able to quantify the relationships between pairs of genes based on their co-occurrence membership in sets of the same gene set library using two different similarity measures: the Jaccard index and a Binomial Proportion test. The process of creating FANs from GMT files is outlined (Figure 1).

Figure 1
figure 1

Process of creating FANs. The process of creating FANs involves gathering datasets and processing them into GMT files. Using these GMT files, networks are created using either the Jaccard index or a Binomial Proportion test. Large and dense networks are filtered using a declustering method and a cutoff is applied to produce the final FANs.

The Jaccard index is a measure of the similarity of two sets, A and B, which is given by the ratio of their intersection to their union:

J = A B A B
(1)

Scores range in values from 0 to 1, where indices of 1 indicate exact similarity and indices of 0 indicate no relation between the sets. In our case, to score similarity between gene pairs, we divided the number of sets for both genes by the number of unique sets each gene belongs to. If we identify the sets A and B with the set of all lines of the GMT file, in which each of two respective genes are present, the the Jaccard index can be taken as a measure of the degree of association between the genes. The Jaccard index scoring method was applied to gene set libraries (GMT files) that contain a small number of genes per functional term with many different functional terms. Eight FANs were created using this method: miRNAs, mouse phenotypes, metabolites, structural domains, GO biological processes, disease genes, and drug targets. For each network we chose a cutoff that maximizes the tradeoff between coverage (maximizing the number of nodes) and sparseness (minimizing the number of links) (Tables 1 and 2).

Table 1 FAN properties
Table 2 Declustering Details

The Jaccard index is biased with respect to our desired measure of similarity when comparing two lists with a large difference in size. For example, if one gene appears in 50 sets, A, and the other in 5 sets, B, but all of these 5 sets are contained within the 50 containing the first gene (B ⊂ A), the Jaccard index is 0.1, a low similarity index even though there is 100% overlap between the two genes. To correct for this we also applied the Binomial Proportions test to measure similarity between gene pairs based on their membership in gene sets. This method was applied to GMT files with a large number of genes per set. We used the z-score from a Normal approximation to the Binomial Proportion test to quantify the similarity between pairs of genes. Z-scores were calculated using the following equation:

z = a c b d b d · 1 b d d
(2)

where a is the number of gene sets the two genes are members of, b is the number of gene sets gene1 is a part of, c is the number of gene sets gene2 is a part of, and d is the total number of gene sets in the GMT file. A threshold for z-scores was chosen individually for each FAN to balance gene coverage and network sparseness (Table 2). Six functional association networks were created using this method: GeneRIF, CMAP co-expression [21], transcription factor co-regulation using ChEA [22] or TRANSFAC [23], GO molecular function [24], and GeneSigDB [25]. More details about each FAN are described below.

Declustering algorithm

Initially, many of the networks generated using the Jaccard index or the Binomial Proportion test were very dense, containing many interactions between highly connected genes. This made it difficult to generate specific subnetworks for input gene lists. To reduce the edge clutter of the FANs while preserving the majority of nodes and the most relevant interactions, we computed a score for each gene pair as follows:

w = a + b
(3)

where w is the weight of the edge; a is the connectivity degree of gene1; and b is the degree of gene2. Scores were sorted and the highest scoring edges were iteratively removed until there was a minimal loss of nodes and maximal loss of edges (Table 2).

Data extraction and FAN assembly

The Genes2FANs database contains 14 different FANs. Some FANs are made purely from human data whereas others are from data collected in mouse. All interactions taken from the mouse are converted to their human orthologs using NCBI’s homologene. Data for the miRNA network was taken from the TargetScan database [26]. Mouse phenotype gene sets were obtained from the Mouse Genome Informatics’ Mammalian Phenotype (MGI-MP) Browser [27]. The ontology of the MGI-MP Browser has a tree structure with the most general phenotypes represented by the root nodes and increasingly specific terms at each additional level down the tree. Starting at the lowest, most specific phenotypes, we merged descendents with their ancestor terms up to the fourth level of the tree producing a condensed set of relations between phenotypes and genes. For the metabolites FAN we derived a GMT file from the Human Metabolome database [28]. Structural domains and their associated Entrez gene symbols were extracted from Pfam [29] and InterPro [30]. The FANs made from GO Biological Process (BP) and GO Molecular Function (MF) terms [24] were assembled using GO Slim. Both OMIM FANs were created from the Online Mendelian Inheritance in Man (OMIM) [31] morbid map. These two GMT files were originally created from OMIM for the Lists2Networks project [32], where the expanded file includes neighboring genes in the PPI. The smallest FAN, drug target, is made using annotated FDA approved drug target relationships extracted from DrugBank [33]. The CMAP co-expression FAN is made from the Connectivity Map (CMAP) which reports drug induced gene expression signatures applied to human cancer cell lines [21]. We created a GMT file containing the top 1000 genes that either increased or decreased in expression after drug perturbation from all the experiments in the CMAP database. Each gene set has an equal size of 1000 genes per experiment in CMAP, 500 up-regulated genes, and 500 down-regulated. Data for the GeneRIF FAN was downloaded from NCBI’s gene reference into function dataset which links PubMed IDs to Entrez gene symbols based on manual curation. The transcription factors ChIP-X FAN is made from the ChEA database [22] which is already stored in a GMT-like file, where the functional terms are transcription factors profiled by ChIP-seq/chip experiments and the genes for each term are putative targets for the profiled factor in each experiment. To create a GMT file from TRANSFAC we identified putative target genes for all the human transcription factor binding matrices in TRANSFAC. We scanned the promoter regions of all annotated human coding genes from the −2000 to +500 nucleotides relative to the transcription start site (TSS) using the Patch program provided by TRANSFAC, and then set arbitrary cutoffs to associate transcription factors to their putative targets. GeneSigDB contains thousands of gene lists from supporting material tables manually curated from gene expression studies, mostly cancer related [25]. A summary of all FANs is provided in Table 1 along with node and edge counts, and network creation cutoffs. A more detailed summary of the effects of declustering can be seen in Table 2 with declustering coefficients and node and edge count listings, before and after declustering, for each of the nine declustered FANs. Additionally, the effects of the declustering algorithm on the global network topology can be seen in Additional file 1: Figure S1.

One of the strengths of FANs is the broad coverage of genes and their interactions. Thus, to quantify the overlap between the different types of FANs we assessed their similarity both at the gene and interaction levels, as well as comparing the FANs to the PPI network (Figures 2 and 3). Similarity was measured using the Jaccard index of the total genes and undirected edges in each of the FANs. Unsurprisingly, the largest FANs: ChEA, TRANSFAC, GeneSigDB, CMAP, PPI, and domains, contain many common genes (Figure 2). The diversity of the FANs can also be seen from the network visualization plots. Most of the networks have a large highly connected component while some networks clearly display a modular structure (Figure 4 and Additional file 1: Figure S1).

Figure 2
figure 2

Heatmap of genes. Heatmap showing the similarity of the genes within each of the FANs and PPI network. Similarity was calculated using the Jaccard index.

Figure 3
figure 3

Heatmap of edges. Heatmap showing the similarity of the interactions connecting genes within each of the FANs and PPI network. Similarity was calculated using the Jaccard index.

Figure 4
figure 4

Topology of the FANs. The global structure of each of the FANs visualized with Cytoscape.

Developing the mammalian protein-protein interaction network

The protein-protein interaction network used in Genes2FANs contains physical interactions between proteins reported in the literature based on experimental evidence. For Genes2FANs we consolidated 13 databases and several published studies listing experimentally verified physical protein-protein interactions. Protein-protein interactions were combined from the following sources: MINT [34], InnateDB [35], NCBI-HPRD [36], KEGG [37], IntAct [38], BioGRID [39], PPID [40], BIND [41], DIP [42], Ma’ayan et al. [43], Stelzl et al. [44], Rual et al. [45], and Yu et al. [46]. Since high-throughput studies may contain higher degree of false positives [47] we filtered the BioGRID [39] and IntAct [38] databases to include only those interactions from studies that reported 10 or less protein-protein interactions. This removes publications that report protein interactions from mass-spectrometry proteomics and yeast-2-hybrid screens. Hence, the Genes2FANs software contains two versions of PPI datasets: filtered and unfiltered.

Web interface

The Genes2FANs web interface was developed using PHP, JavaScript, AJAX, and Perl. The core code for building subnetworks is implemented in C with a custom built hash function for fast access of network nodes and links. FNV, the subnetwork viewer, was implemented using Adobe Action Script 3.0 [19]. Currently, the application resides on a Linux server running Apache. To begin an analysis, users can enter a gene list by adding Entrez gene symbols one at a time or by pasting a list for upload. Results are presented to the user as an interactive subnetwork diagram and a table containing intermediate genes with z-scores indicating how significant the intermediates are for the input gene list. The interactive resultant subnetwork allows users to reposition nodes, hover over edges to reveal the gene sets that contributed to the edge, as well as pan and zoom. Users are presented with a choice of FANs to include and several options to control the size and aesthetics of the resulting subnetworks. Intermediate genes are displayed in a table ordered by their z-score computed using a Binomial Proportion test. There are also various export options allowing users to save the network for offline analysis. Figure 5 shows a screenshot of the web interface.

Figure 5
figure 5

Converting PubMed queries to lists of Entrez gene symbols. PubMed queries are first converted into a list of PubMed IDs using NCBI’s e-utilities. For each PubMed ID a list of genes is obtained using GeneRIF. Genes are tallied and sorted by their occurrence and the top N genes are uploaded automatically into Genes2FANs.

PubMed search feature

If users do not have a specific gene list to enter they can query PubMed with any search term to generate a list of genes. Genes2FANs provides users with the option to choose the number of genes to return from a PubMed search, because shorter lists are more appropriate for specific queries whereas longer lists are better for ambiguous search terms. To facilitate this function we use NCBI’s e-utilities to turn search terms into their corresponding PubMed IDs and then use the GeneRIF file to convert the PubMed IDs into human genes with occurrence counts. Genes are ranked by their number of occurrences in all returned PubMed IDs. This process is summarized in Figure 6.

Figure 6
figure 6

The Genes2FANs web interface. A screenshot showing the results of running Genes2FANs with the query “eye color”. On the left side of the page users can enter a PubMed query or a gene list and customize the output settings. The resulting subnetwork and a table listing ranked intermediates are shown on the right. Users can also obtain all the functional and binding interactions for a specific gene.

Results and discussions

Analysis of disease gene FAN

To demonstrate the capabilities of Genes2FANs we applied it to find relationships between disease genes. Disease gene discovery using network approaches by pathway reconstruction has been recently proven to be very useful. Typically applications first construct a large background network and then use disease genes as seed nodes for building subnetworks that connect the seed nodes [1, 4852]. Here we implemented a similar approach to obtain a global view of subnetworks created from many disease gene lists. Using the OMIM database we compiled a list of 90 common genetic disorders. From the OMIM morbid map dataset [31] we compiled a GMT file containing all diseases with at least 10 genes (n = 90). We then used Genes2FANs to connect the genes for each disease without any intermediates using only the PPI networks or the FANs, without the OMIM FAN. We then used the disease terms from the same GMT file as input for the PubMed query tool of Genes2FANs, setting the number of returned genes to 100. The size of networks using the PPI networks only or using the FANs only (without the OMIM FAN) was then recorded. To compute the correlation between the PPI and FAN links for all the diseases, we plotted the log of the ratio of number of PPI edges against the PPI edges to functional edges. We then calculated the mean of the data points by partitioning the points into groups of 10 for the OMIM gene lists and 15 for the subnetworks made using the query PubMed function to generate a local fit. The variation was illustrated in the plot by shading the region within one standard deviation of the mean of each bin.

With both methods, directly from OMIM or through PubMed queries, diseases show an inverse correlation between protein-protein interaction (PPI) links and other types of functional annotation links, segregating diseases with many known genes into two broad categories: those with gene products that physically interact, and those that interact functionally but not physically (Figures 7 and 8). This trend is statistically significant based on a Spearman rank correlation of 0.73 which has a p-value of 2.97×10-10 for the PubMed queried lists, and 0.27 for the lists directly from OMIM (p = 0.0065). The diseases that show high level of PPI and low level of functional associations include breast, ovarian, pancreatic, colorectal, thyroid, gastric, lung, and prostate cancers, as well as ataxia and leukemia (Figure 9); whereas diseases that display high level of functional interactions and low level of PPI are: deafness, type-2 diabetes mellitus, asthma, schizophrenia, autism and epilepsy. To ensure that this is not an artifact of the declustering algorithm on the FANs we ran the same process using the nine FANs before declustering. The declustering process had little effect on these results (Additional file 2: Figure S2 and Additional file 3: Figure S3) with Spearman rank correlation of 0.38 which has a p-value of 0.00026 for the PubMed queried lists, and 0.57 for the lists directly from OMIM (p = 1.99×10-7). The finding that some diseases have disease genes that are linked mostly through PPI, while other disease genes are mostly connected through FANs, is important because many investigations attempt to use protein interactions for novel disease gene discovery, for example, prioritizing mutations in genes detected by exome sequencing. This suggests that disease gene discovery using a PPI approach would work well for diseases such as cancers where many PPIs connect the disease gene products; however, for other complex diseases such as autism and type-2 diabetes, FANs would potentially be better for disease gene discovery.

Figure 7
figure 7

Distribution of edges for the disease gene lists. The distribution of edges for disease subnetworks created using genes directly from OMIM (A) and the disease terms with a maximum of 100 returned genes from the PubMed query tool of Genes2FANs (B). Diseases with a sum of PPI and functional edges less than 10 were omitted from both distribution plots.

Figure 8
figure 8

Correlation between subnetwork size and the edge ratio of PPIs to FANs. Scatterplots showing the correlation between the number of edges in the PPI subnetworks for each disease and the log of the ratio of PPI edges to functional edges. The red line depicts the mean of the data points (calculated by partitioning the points into groups of 10 for the OMIM disease gene lists (A) and 15 for the subnetworks made using the query PubMed function (B)). The blue dotted lines show one standard deviation away from the mean.

Figure 9
figure 9

Top diseases. The top 10 diseases with the greatest difference in edge counts for the PPI vs. FANs disease subnetworks made from the OMIM disease gene lists (A) and the top 20 diseases for the subnetworks made using the query PubMed function (B).

Comparison to other similar tools

Finally, we compared Genes2FANs to other similar presently available online software tools. To compute the average number of genes returned for each of the tools we used a list of 20 randomly selected human genes and calculated the average and standard deviation of unique interactions reported by each tool. We used the nearest neighbor function of Genes2FANs and summed the number of interactions returned from each of the functional networks and the PPI. For PIPs [8], we ran the tool using the default settings and counted every interaction that had a score higher than 0. We ran HEFalMP [9] to explore a gene in relation to all genes in the context of all biological processes, only counting potentially interacting genes that had a confidence score higher than 0.5. To count the number of interactions returned by GeneMania [7] we searched for human genes with default settings and counted each edge as a separate interaction. Similarly, for STRING 9.0 [5] we ran the gene query as a human gene with default settings and counted unique edges. We also tested FunCoup [10] with its default settings. By default FunCoup applies an algorithm to reduce the number of probable links for a gene query. As a result many of our queries were capped at 60 returned genes when more significant interactions were identified.

It is difficult to quantify the accuracy of our approach compared with other similar tools since there is a lack of gold standard for functional relationships between genes. As a result, we cannot fairly compare the sensitivity and specificity of our tool against existing similar tools that integrate functional relationships. To show the differences between each tool we have chosen to focus on the number of interactions that are returned for an input gene (Table 3). The totals elucidate a clear pattern; each tool is suited for different purposes. In terms of accuracy, using a tool such as GeneMania, PIPs, or FunCoup might provide a user more reliable novel PPIs likely to interact with their query. On the other hand, for a more comprehensive analysis, STRING or HEFalMP would be the best performer. It is also worth noting that there is a great deal of overlap between Genes2FANs and these existing systems. As an example, using BRCA1 as input for each of the tools with default settings, and as input for the nearest neighbor function in Genes2FANs, we observed that most of the genes returned by STRING 9.0 (10 out of 10 genes), GeneMania (12 out of 19), and FunCoup (12 out of 25) were in our PPI dataset. All but four of our functional interactions for BRCA1 were returned by HEFalMP with varying degrees of confidence and three of these functional interactions were also returned by PIPs. Those genes identified by Genes2FANs but not in HEFalMP, are OVCAS1, FAM82A1, AIMP2, and MIR21. These genes were implicated in the literature to be associated with breast and/or ovarian cancer and may be indirectly related to BRCA1.

Table 3 Comparison with Similar Tools

Conclusions

Genes2FANs is a potentially useful tool for interpreting the relationships between gene lists in the context of their various functions and networks. Combining these functional association interactions with physical protein-protein interactions from high and low throughput datasets can be useful for revealing new biology and help form hypotheses for further experimentation. Our observation of disease gene lists commonly connected by either PPIs or FANs, but not by both, can assist with disease gene discovery strategies using network analysis and disease gene classifiers.

However, Genes2FANs is not without limitations. Currently, it does not include a confidence score for each edge. We also keep the FANs separate but all FANs can potentially be integrated into one large network. In the future we plan to constantly continue to update Genes2FANs with more FANs and to add more interactive features to the website. We also plan to develop a feature that will allow users to upload their own gene-set libraries for constructing their own functional networks. Additionally, we are working on improving our network generation process to improve the quality of the FANs.

Availability and requirements

Project name: Genes2FANs

Project home page: http://actin.pharm.mssm.edu/genes2FANs

Operating System: Platform Independent

Programming Language: HTML, CSS, JavaScript, Perl, C, PHP, Python, Flash/Action Script

Other Requirements: Adobe Flash Player 9.0 or higher

License: GNU GPL