Predicting Node Characteristics from Molecular Networks

Protocol
Part of the Methods in Molecular Biology book series (MIMB, volume 781)

Abstract

A large number of genome-scale networks, including protein–protein and genetic interaction networks, are now available for several organisms. In parallel, many studies have focused on analyzing, characterizing, and modeling these networks. Beyond investigating the topological characteristics such as degree distribution, clustering coefficient, and average shortest-path distance, another area of particular interest is the prediction of nodes (genes) with a given characteristic (labels) – for example prediction of genes that cause a particular phenotype or have a given function. In this chapter, we describe methods and algorithms for predicting node labels from network-based datasets with an emphasis on label propagation algorithms (LPAs) and their relation to local neighborhood methods.

Key words

Functional linkage networks Gene function prediction Label propagation 

1 Introduction

Networks are commonly used to represent relationships between genes or proteins. For example, in a protein–protein interaction network, nodes represent proteins and the edges represent physical interactions between the connected nodes. Other widely available networks include genetic interaction networks and coexpression networks (see Table 1 for a summary of common network types). In parallel with the increase in number and size of interaction networks that are now available for various organisms, much work has been done to analyze, model, and make predictions from such network-based data.
Table 1

This table summarizes some examples of widely used interaction networks whose edges are predictive of cofunctionality between the connected genes or proteins

Network

Experimental methodology

Description

Cocomplex network

Copurification (Affinity capture)

An interaction is inferred between members of subunits in a purified protein complex using affinity purification and one or more additional fractionation steps

Protein interaction networks

Protein-fragment complementation assay (PCA)

Two proteins of interest are fused to complementary fragments of a reporter protein; when proteins of interest react the reporter protein fluoresces

Two-hybrid

Bait protein expressed as a DNA binding domain (DBD) fusion and prey expressed as a transcriptional activation domain (TAD) fusion; interaction measured by reporter gene activation

Genetic interaction networks

Synthetic genetic array (SGA)

Genetic interactions between two genes is inferred when the mutation (or deletion) of both genes results in a phenotype that is unexpected based on the single mutations of each of the genes. For example, in yeast, synthetic lethal genetic is inferred whenever a double mutation of two nonessential genes results in lethality

diploid based synthetic lethality analysis on microarrays (dSLAM)

Colocalization networks

Green fluorescent protein (GFP) fusion

An interaction is inferred from colocalization of two proteins in the cell, including codependent association of proteins with promoter DNA in chromatin immunoprecipitation experiments

Coexpression networks

Microarray

Quantification of changes in expression level of genes by measuring the abundance of their corresponding mRNA in different conditions based on hybirdization of labeled mRNA to known probes

Serial analysis of gene expression (SAGE)

Quantification of abundance of transcripts by cloning and sequencing of the extracted mRNA

Transcriptional regulatory networks

ChIP-on-Chip

Combines Chromatin immunoprecipitation (ChIP) with microarray (chip) to determine the binding sites of DNA-binding proteins on genome-wide basis

Coinheritance (shared phylogenetic profiles)

 

Coinheritance networks are derived from phylogenetic profiles that summarize the presence/absence of homologous proteins in various species

An area of particular interest is prediction of node characteristics (or labels) in a network. For example, in gene function prediction, given a network and labeled genes that are involved in a function of interest (these input genes are referred to as “query genes” or “positives”), the goal is to predict other genes that are deemed to be involved in the same function (e.g., (1)). This is a well-motivated task: even in model organisms such as yeast, a large number of genes have not yet been annotated with a precise function (see Fig. 1). Similarly, in disease gene prioritization, given a network and a group of genes that are involved in a given disease, the goal is to prioritize other genes based on how likely they are to be involved in the same (or similar) disease(s) (e.g., (2, 3)). Often, the list of genes to be prioritized is a subset of the genes that are present in the network; for example, disease-associated chromosomal loci from Genome Wide Association studies (GWAS). Algorithms for solving these problems use the observation that genes (or proteins) that are coexpressed, or have similar physical or genetic interactions, tend to have similar functions (see ref. 4 for a review) or result in similar phenotypes (see ref. 5 for a review) – this principle is referred to as “guilt-by-association.”
Fig. 1.

This figure depicts our current knowledge of protein function in yeast. The number of annotations are based on informative GO Biological Process associations (downloaded April 2010). We define informative GO categories as those having fewer than 500 annotations.

In the case of function prediction, positive gene labels can be obtained from databases such as Gene Ontology (GO) (6), KEGG (7), and MIPS (8). These databases provide both a controlled vocabulary for describing categories of gene function and curated lists of genes annotated to these functions. For disease prioritization, positive gene lists can be obtained from the Human Phenotype Ontology (9) and the OMIM database (10), which provide genes associated with various phenotypic abnormalities and genetic diseases.

Here, we focus on algorithms that solve a binary classification problem. If necessary, it is easy to generalize these approaches to predict multiple functions/diseases per gene. Formally, given a network over all entities (genes or proteins), and a set of binary labels where positives are labeled as +1, the goal is to predict which unlabeled nodes are likely to be positives. The “guilt-by-association” approach only considers direct neighbors of nodes when making predictions (1). However, often indirect interactions and global network topology can improve the prediction performance (11, 12, 13). As such, a large number of models and algorithms have been proposed that consider both direct and indirect interactions (12, 13, 14, 15, 16, 17, 18). For example, label propagation algorithms (LPAs) assign continuous scores (predicted labels) to all nodes in the network while considering both local and global network topology. In addition to offering a principled way of incorporating indirect interactions, the complexity of LPAs scales with the number of edges in the network and thus LPAs are computationally feasible for very large networks – empirically, less than 0.1% of total possible edges are often observed in real-world networks.

In this chapter, we focus on describing commonly used algorithms for predicting gene labels (gene function or involvement in a disease) from network-based data. In particular, as done in (19), we categorize such algorithms into two broad categories; those that use the node’s local neighborhood and those that use global network topology when predicting node labels. For the latter category, we mainly focus on LPAs. As we show, LPAs are closely related to local neighborhood approaches; we describe an approximation framework to LPAs that will allow us to directly derive the local neighborhood methods.

The rest of this chapter is organized as follows: In Subheading 2.1, we describe how to construct networks from various high-throughput data sources, with the purpose of using these networks for prioritizing genes or prediction gene function; in Subheading 2.2, we review algorithms for predicting node labels from a network; in Subheading 2.3, we describe several methods for constructing networks from multiple high-throughput data sources; and in Subheading 2.4, we describe several online resources for gene prioritization and predicting gene and protein function.

2 Methods

The task of gene function prediction requires three components: A network that can be constructed from one or many different high-throughput data sources; a set of positive genes (seeNote 1); and an algorithm for making prediction from the network. In the next section, we describe how to construct individual networks that represent the evidence for cofunctionality implied by individual high-throughput datasets.

2.1 Constructing Networks for Predicting Gene Function

To predict gene function, we assume that we are provided with a network whose nodes correspond to genes and whose edges represent the strength of the evidence for cofunctionality between the connected genes. These networks are called functional linkage networks (FLNs). There are several different types of FLNs that support good function prediction performance including those whose edge weights represent coexpression, genetic interaction, protein interaction, coinheritance, colocalization, or shared domain composition of the connected genes. A number of studies have demonstrated a drastic improvement in the accuracy of function prediction when multiple data sources are combined (13, 17, 18, 20, 21, 22, 23, 24). Below, we first review how to construct individual networks from high-throughput data sources; in Subheading 2.3, we describe how to combine multiple networks into one composite network for input to a label prediction algorithm.

We broadly classify networks into those that are derived from interaction-based data and those that are derived from profile-based data. The former includes networks derived from protein and genetic interaction datasets, and the latter includes those derived from gene expression profiles or patterns of protein localization.

For profile-based datasets, such as gene expression, the edges in the corresponding network are constructed from pairwise similarity scores. Determining the appropriate similarity metric for a given data type is an active area of research. For example, much work has been carried out on constructing coexpression networks (25, 26). Here, we present a simple method that performs well on a variety of data sources (as used in refs. 17, 27). In particular, for many types of profiled-based data, we have found that the Pearson Correlation Coefficient (PCC) results in networks with comparable or better performance than other similarity metrics. For binary profile-based data such as protein localization, prior to taking the PCC, we use the following background correction that significantly improves the resulting FLNs: given a binary matrix B with n rows (genes) and d columns (features), we set all 1’s in column i to \( -\mathrm{log}({p}_{i}^{(1)})\) and 0’s to \( \mathrm{log}(1-{p}_{i}^{(1)})\), where \( {p}_{i}^{(1)}\) is the probability of observing 1 in column i. In this way, genes that share “rare” features will have higher correlation than those that share “common” features (seeNote 2).

For network-based data, we can always use the binary interactions alone as the final network. However, several studies have observed a drastic improvement in performance when constructing FLNs using the PCC between interaction partners of genes or proteins (28, 29, 30). Calculating PCC on frequency-corrected data as described above further improves the performance (28).

2.2 Predicting Node Labels from Networks

In this section, we review and discuss algorithms for predicting binary labels using networks. In particular, we focus on LPAs and their relationship to simpler direct and indirect neighbor methods.

Notation

Here, we assume we are given a network, represented as a symmetric matrix W. Assuming there are n nodes (genes) in the network, then W is an n  ×  n matrix with entries wji=  wij  ≥  0 representing the weighted edge between nodes i and j. For example, in the case of protein–protein interaction, wij can be binary (indicating the absence or presence of a physical interaction) or weighted according to the score (e.g., −log of p-value) of the interaction between proteins i and j. We represent the labels using a vector \( \overrightarrow{y}\in {\{0,1\}}^{n}\), where positive nodes (e.g., those involved in a given function of interest) are labeled as +1 and unlabeled nodes are labeled as 0. Ideally, in addition to the positives, some nodes should serve as negative examples and thus would be assigned a negative label (yi  =  −1); however, such negative examples are rarely available – here, we assume we only have positive and unlabeled nodes.

Local Neighborhood Approaches

In guilt-by-association, genes are assigned a score based on their direct connections to the positive genes. For example, in (1), a gene is predicted to have the same function as those of the majority of its direct neighbors. More recent studies have extended guilt-by-association to include second-degree (indirect) interactions (13) or consider a small neighborhood around the positively labeled genes (31) when assigning scores to the unlabeled genes. Below, we present a single general framework that serves as the basis for deriving existing local neighborhood methods.

As a first attempt, we can calculate the gene scores for an unlabeled gene fi as the weighted sum of the labels of its direct neighbors as \( {f}_{i}={\displaystyle \sum _{j=1}^{n}{w}_{ij}{y}_{j}}\) (e.g., (32)). Note that the summation is over all n genes; since the labels are binary, if W is also binary, this expression counts the number of neighbors of j that are labeled as +1. For a weighted network, this expression weights the label of i’s positive neighbors according to their connection strength. However, it is standard practice to normalize the matrix W using weighted node degrees, as it results in better performance:
$$ {f}_{i}=\frac{1}{{d}_{i}}{\displaystyle \sum _{j=1}^{n}{w}_{ij}{y}_{j}},$$
where \( {d}_{i}={\displaystyle \sum _{j=1}^{n}{w}_{ij}}\) is the weighted degree of node i. After ­normalizing W, the score vector can be computed in matrix form as \(\overrightarrow{f}={D}^{-1}W\text{}\overrightarrow{y}=P\text{\hspace{0.05em}}\overrightarrow{y}\), where D is a diagonal matrix with diagonal ­elements di’s i.e., D  =  diag(d)]. The matrix \( P={D}^{-1}W\) is known as the Markov transition matrix (or a singly stochastic matrix). Since all row sums of P equal 1, they are often interpreted as a probability­ distribution over random walks starting from a given node: for example, pij represents the probability of a random walk from node i to node j and we have \( {\displaystyle \sum _{j=1}^{n}{p}_{ij}=1}\).
Using the above formulation allows us to extend guilt-by-association to indirect neighbors. In particular, one can easily calculate the probability of a random walk of length 2 between all nodes in the network by computing P  2. The (i,j)-th entry of P  2 is given by \( {[{P}^{2}]}_{ij}={\displaystyle \sum _{k=1}^{n}{p}_{ik}{p}_{kj}}\) and represents the probability of a random walk of length 2 from node i to node j. In this way, we can include P  2 when calculating the node scores as:
$$ {f}_{i}={\displaystyle \sum _{j=1}^{n}{p}_{ij}{y}_{j}+{\displaystyle \sum _{j=1}^{n}{[{P}^{2}]}_{ij}{y}_{j}}},$$
(1)
where the second term represents the probability of node j having the same label as its indirect neighbors (those neighbors that are two-steps away). Similarly, this approach can be extended to include other nodes at a distance of length r (usually r  <  4) by noting that [Pr]ij represents the probability of a random walk from i to j in r steps. We note that previous approaches have shown that increasing r beyond two often results in a degradation of the prediction performance (we elaborate on this point when presenting LPAs).

In the context of the above representation, several existing direct and indirect neighbor-based methods define node scores as \( {f}_{i}={\displaystyle \sum _{j=1}^{n}{p}_{ij}{y}_{j}+{\displaystyle \sum _{j=1}^{n}{[{\widehat{P}}^{2}]}_{ij}{y}_{j}}}\), where \( {\widehat{P}}^{2}\) is a modified version of Pˆ  2 with some entries set to zero (or modified). For example, in the BIOPIXIE graph search, the second summation only includes the top k genes that had the highest direct neighbor score; if node j is not among the top scoring direct neighbors of the positive genes, then \( {[{\widehat{P}}^{2}]}_{ij}=0\).

In this section, we describe a simple framework for assigning node scores based on the label of other nodes in their local neighborhood. In the next section, we show that the local neighborhood approaches, as presented in Eq. 1, can be thought of as an approximation to LPAs.

Label Propagation Algorithms

Intuitively, LPAs can be thought of as an iterative scheme where positively labeled nodes propagate their labels to their neighbors and neighbors of neighbors and so on. At termination, the final score of a node is computed as the total amount of propagation it has received throughout this process. For example, Fig. 2 shows the solution to the LPA proposed by (33) when applied to a ­hypothetical network with 34 unlabeled and six positive nodes. As shown, the assigned scores are higher for nodes that have many short-length paths to the positives: nodes that are in the periphery of the cluster that contains the positives have lower scores than those in the center of the cluster.
Fig. 2.

Given six positive nodes (colored blue) on the left network, the network on the right shows the scores of all nodes assigned by LPA. The node colors on the right network depict the scores. A common approach to making predictions from the scores is to select the top k nodes according to their scores.

Several variants of LPA have been used in gene function prediction including the works of (16, 34, 35). Here, we review an LPA derived from the Gaussian fields algorithm as it allows us to analytically calculate the final node scores (reviewed in ref. 36) and it has been shown to produce state-of-the-art performance in gene function prediction in yeast and mouse (17, 23).

The basic assumption of LPA is that the score of node i at iteration (or time) r can be computed from a weighted combination of score of its neighbors at the previous iteration and its initial label yi. In its simplest form, we can state this principle as \( {f}_{i}^{(r)}=\lambda {\displaystyle \sum _{j=1}^{n}{w}_{ij}{f}_{j}^{(r-1)}+(1-\lambda ){y}_{i}}\), where \( \lambda \) is a constant such that \( 0<\lambda <1\) thus making \( {f}_{i}^{(r)}\) a convex combination of the scores of its neighbors and its initial label.

However, in the above formulation for node scores \( {f}_{i}^{(r)}\), nodes with high weighted degree that have positive neighbors will end up influencing the scores of many nodes in their local neighborhood; it is standard practice to correct for this effect by normalizing W. In particular, we can normalize W in two different ways: (1) by dividing each row with its row sum and thus using the Markov Transition matrix in place of W: \( P={D}^{-1}W\) resulting in the expression \( {f}_{i}^{(r)}=\lambda {\displaystyle \sum _{j=1}^{n}\frac{{w}_{ij}}{{d}_{i}}{f}_{j}^{(r-1)}+(1-\lambda ){y}_{i}=}\lambda {\displaystyle \sum _{j=1}^{n}{p}_{ij}{f}_{j}^{(r-1)}}+(1-\lambda ){y}_{i}\) or (2) performing a symmetric normalization \( \dot{W}={D}^{-1/2}W{D}^{-1/2}\) resulting in the expression \( {f}_{i}^{(r)}=\lambda {\displaystyle \sum _{j=1}^{n}\frac{{w}_{ij}}{\sqrt{{d}_{i}{d}_{j}}}{f}_{j}^{(r-1)}+(1-\lambda ){y}_{i}}\). These two choices of normalization result in slightly different node scores. We first pursue the asymmetric normalization, thus using the Markov Transition Matrix P, as it will allow us to directly compare LPA to the local neighborhood method presented in Eq. 1.

Using P instead of W, we can write the node scores, at iteration r, in matrix form as \( {\overrightarrow{f}}^{(r)}=\lambda P{\overrightarrow{f}}^{(r-1)}+(1-\lambda )\overrightarrow{y}\). A nice property of LPA is that we can analytically compute the final (steady-state) score by first writing the scores as \( r\to \infty \):\( \overrightarrow{f}=\lambda P\overrightarrow{f}+(1-\lambda )\overrightarrow{y}\), where, for simplicity, we represent \( {\overrightarrow{f}}^{(\infty )}\) as \( \overrightarrow{f}\). By re-arranging this equality we can directly calculate the final score vector as:
$$ \overrightarrow{f}=(1-\lambda ){(I-\lambda P)}^{-1}\overrightarrow{y}.$$
(2)
Below we further simplify the solution to LPA using Taylor’s matrix expansion. In particular, we make use of the following equality \( {(I-\lambda P)}^{-1}={\displaystyle \sum _{r=0}^{\infty }{(\lambda P)}^{r}}\). This equality holds in our case because we have \( 0<\lambda <1\), which makes the largest eigenvalue of λP less than 1 – a condition needed for using the Taylor’s expansion (seeNote 3). Thus, we can write the final score vector as:
$$ \overrightarrow{f}=(1-\lambda ){\displaystyle \sum _{r=0}^{\infty }{\lambda }^{r}{P}^{r}\overrightarrow{y}}.$$
(3)

The above representation clarifies the connection between LPA and the local neighborhood method as presented in Eq. 1. However, there is a major difference: in LPA, because we multiply P  r with λr  <  0, propagation from the positive nodes declines rapidly with paths of increasing length, whereas there are no such guarantees when using Eq. 1. This fact offers an explanation for the rapid decline in performance of neighborhood methods with path length increasing beyond two (e.g., as observed in (11)).

Similar to the derivation above, the solution to LPA with symmetrically normalized W can be written as:
$$ \overrightarrow{f}=(1-\lambda ){(I-\lambda \dot{W})}^{-1}\overrightarrow{y}.$$
(4)

Again, using the Taylor’s theorem, we can correspondingly derive the following form: \( \overrightarrow{f}=(1-\lambda ){\displaystyle \sum _{j=1}^{\infty }{\lambda }^{r}{\dot{W}}^{r}\overrightarrow{y}}\) since we also have that the largest magnitude eigenvalue of \( \lambda \dot{W}\) is again less than 1 (37).

The solution to LPA depends on a parameter \( 0<\lambda <1\). Note that smaller \( \lambda \) allow for more influence from paths of increasing length, in contrast, with large \( \lambda \), only local paths (e.g., those with path lengths of 2 or 3) influence the final solution. In practice, \( \lambda \) can be set using cross-validation (seeNote 4).

In this section, we present a formulation of LPA that is based on iterative propagation of scores to direct neighbors; it is also possible to derive the solution to LPA by solving a convex optimization problem (33) as we describe in Note 5. In particular, we have derived LPA for two different normalization of the matrix W: by using the asymmetric matrix P or symmetrically normalized matrix \( \dot{W}\) (seeNote 6). This formulation allows us to generalize several other LPAs: for example, the RankProp algorithm (34) uses asymmetrically normalized P (as in Eq. 3). On the contrary, the functional Flow algorithm (16) does not explicitly set a decay parameter \( \lambda \) or downweigh the influence of hubs by normalizations: these criterions are implicitly enforced by always propagating to shortest distance neighbors first and subtracting out-flow from in-flow. We also have shown that approximating the LPA solution using Taylor’s matrix expansion results in an algorithm very similar to the local neighborhood method.

In addition to LPAs, discrete Markov Random Fields (DMRFs) present another class of methods that consider the global network topology when making predictions. In particular, DMRFs can be viewed as a discrete version of LPA where the predicted scores are constrained to be binary: i.e., \( \overrightarrow{f}={[0,1]}^{n}\). However, the integer constraint makes the DMRFs intractable. Nevertheless, previous studies have used DMRFs for predicting gene function either using simulated annealing (38) or approximating the solution using coordinate descent (39). However, typically solving for the node scores in a DMRF requires considerable computational effort. We note that several studies have found that methods based on discrete MRFs do not perform any better than LPAs or neighborhood-based methods (16, 19).

2.3 Constructing a Composite Network from Multiple Data Sources

As we discussed in Subheading 1, previous studies have shown that combining multiple high-throughput data sources into a single functional linkage network (FLN) results in better prediction performance. There are two broad categories of methods for constructing FLNs: (1) probabilistic FLNs, where edges between two genes represent the probability of their cofunctionality (functional coupling) and (2) FLNs constructed by weighted summation of the underlying networks, each constructed from a different data source. Probabilistic FLNs are commonly constructed using Bayesian network. In fact, most existing methods are similar to Naïve Bayes (13, 40, 41, 42, 43). In the following, we describe the second approach for constructing FLNs. For a more extensive review on the subject, we refer the reader to (44).

A simple and widely used approach for combining multiple networks constructs a composite network, denoted by a weighted matrix W *, as the average of the individual networks: e.g., \( {W}^{*}=\frac{1}{D}{\displaystyle \sum _{d=1}^{D}{W}_{d}}\), where we have D networks represented as Wd (24). We can extend the above approach by taking a weighted sum: \({W}^{*}={\displaystyle \sum _{d=1}^{D}{\alpha }_{d}{W}_{d}}\), where each network Wd is weighted according to the coefficient \( {\alpha }_{d}\) (17, 18, 21, 22). For example, we can set these coefficients to downweigh redundant network and ignore irrelevant ones. These coefficients \( {\alpha }_{d}\) are often set to optimize the performance of the composite network in predicting a single or a group of gene functions. As an example, (17) used linear regression to determine the coefficients \( \overrightarrow{\alpha }\) (seeNote 7).

To summarize, combining multiple networks into a single graph creates a better platform for the label propagation performance. In this section, we briefly described two simple and widely used methods for combining multiple networks. In the next section, we list online resources for prioritizing genes and predicting gene function from multiple heterogeneous data sources.

2.4 Online Resources for Predicting Gene Function

Table 2 summarizes several online resources that allow for on-demand gene prioritization and prediction of gene function: given a set of query (positive) genes, these resources use internally collected FLNs to predict other genes that are likely to share the same function or phenotype. Generally, there are three steps to making predictions: (1) collection and construction of multiple FLNs (offline), (2) integration of multiple FLNs into a single combined network (offline or online), and (3) predicting genes that are similar to the query set of genes (online).
Table 2

Summary of Web-based resources for on-demand prediction of gene function from FLNs

 

Organisms

Algorithm

Flexibility in choosing FLNs

GeneMANIA (49)

Yeast, mouse, human, fly, worm, Arabidopsis thaliana

Network integration. Using the choice of (a) linear regression, (b) averaging, or (c) averaging while accounting for redundancy

Label prediction. LPA using the combined network

Allows users to choose any combination of the available FLNs (by type or based on a publication) and network integration method. Additionally, users can upload their own networks

FunCoup (50)

Yeast, mouse, human, fly, worm, rat, Arabidopsis thaliana, Ciona intestinalis

Network integration. Naïve Bayes

Label prediction. Uses Naïve Bayes classifiers, which make prediction based on direct neighbors in the combined network

Allows users to choose a network type (for example coexpression) but no flexibility on using specific networks based on a given publication

STRING (51)

630 Organisms

Network integration. Individual scoring of FLNs prior to construction of a network combination using a Noisy OR model

Label prediction. Considers direct neighbors only

All predictions are made from a fixed network

BioPIXIE (13)

Yeast

Network integration. Naïve Bayes

All predictions are made from a fixed network

Label prediction. Uses direct and second-order neighbors heuristic (BioPIXIE Graph Search)

MouseNet (52)

Mouse

Network integration. Naïve Bayes

All predictions are made from a fixed network

Label prediction. Uses direct and second-order neighbors heuristic (BioPIXIE Graph Search)

HFalMp (43)

Human

Network integration. Regularized Naïve Bayes

All predictions are made from a fixed network

Label prediction. Uses direct and second-order neighbors heuristic

3 Conclusions

In this chapter, we describe how to construct networks from high-throughput data sources and how to combine multiple networks into a composite functional linkage network. In addition, we provide a review of algorithms for predicting node labels from networks; such algorithms are often applied to predict genes that are involved in a given function or result in a specific phenotype. In particular, we focus on describing the label propagation algorithms and their relation to simpler neighborhood-based methods. As we show, these two types of algorithms are closely related. This observation may explain why several studies have found that neighborhood methods result in similar performance to LPAs (16, 45) – noting that average shortest distance in molecular interaction networks often follows the same distribution as small-world networks; in other words, most nodes are connected by small number of steps (e.g., 3 or 4). Finally, we summarize several online resources for gene prioritizing and predicting gene function.

4 Notes

  1. 1.

    Positive Labels for Gene Function Prediction. As mentioned in the Introduction, the set of positive labels can be derived from online databases such as Gene Ontology (GO) (6), MIPS (8), and KEGG (7). GO is one the most widely used annotation databases covering a large number of organisms. GO defines three hierarchies for describing properties of gene products: Biological Process, Molecular Function, and Cellular Component. The categories defined in GO range from very broad properties that consist of hundreds of genes (e.g., biological regulation) to very specific properties that consist of only a few genes (e.g., positive regulation of mitosis). Algorithms for predicting gene function are often tested on categories that have between 10 and 300 annotations; these categories have enough positives for training without being too broad (23). For a discussion on choosing informative GO categories see ref. 46.

     
  2. 2.

    Sparsification of FLNs. To construct an FLN from profile-based data, we can use a similarity metric such as the PCC; for example, in a coexpression network, the edge between protein i and j is the PCC between their expression profiles. However, since the PCC is often nonzero for many pairs of genes, we can often improve performance (both in accuracy and computational time requirement) by sparsifying the PCC derived networks. A common way for doing this is to keep the top m interacting partners for each gene and set the rest to zero; m can range from 50 to 100 as done in (17, 21, 23). A second approach is to set a threshold value t, where all PCC’s smaller than t are set to zero (25).

     
  3. 3.

    Taylors Matrix Expansion Theorem. To be able to use the Matrix Expansion Theorem we need to ensure that the largest magnitude eigenvalue of P is less than 1: \(\underset{i}{\mathrm{max}}\left|{\sigma }_{i}\right|<1\). We use the Perron-Frobenius Theorem (PFT) to show that this condition holds: from PFT we have that \(\underset{i}{\mathrm{min}}{\displaystyle \sum _{j=1}^{n}\lambda {p}_{ij}}\text{ }\lambda {p}_{ij}\le \underset{i}{\mathrm{max}}\left|{\sigma }_{i}\right|\text{\hspace{0.05em}}\text{\hspace{0.05em}}\le \underset{i}{\mathrm{max}}{\displaystyle \sum _{j=1}^{n}\lambda {p}_{ij}}\text{ }\) note that \( {\displaystyle \sum _{j=1}^{n}{p}_{ij}=1}\) for all i, since we have a multiplication with the constant \( 0<\lambda <1\), then the maximum value of all row sums is less than 1.

     
  4. 4.

    Setting the Parameter\( \lambda \)in LPA. We can set the parameter \( \lambda \) using cross-validation. To do so, we investigate the performance of various settings of \( \lambda \) on the validation set (a fraction of training data not used to in training). In practice, we have found that setting \( \lambda \approx 0.5\) (when solving LPA in Eq. 4) results in good performance in a wide variety of prediction tasks.

     
  5. 5.
    LPA as the solution to a convex optimization problem. LPA as proposed by (33, 47) is derived using the following objective function:
    $$ \mathrm{arg}\underset{\overrightarrow{f}}{\mathrm{min}}c{\displaystyle \sum _{i=1}^{n}{({f}_{i}-{y}_{i})}^{2}+\frac{1}{2}{\displaystyle \sum _{i,j=1}^{n}{w}_{ij}{\left(\frac{{f}_{i}}{\sqrt{{d}_{i}}}-\frac{{f}_{j}}{\sqrt{{d}_{j}}}\right)}^{2}}},$$
    (5)
    where c  >  0 is a constant. This equation can be written in matrix form as \( \mathrm{arg}\underset{\overrightarrow{f}}{\mathrm{min}}c{(\overrightarrow{f}-\overrightarrow{y})}^{T}(\overrightarrow{f}-\overrightarrow{y})+{\overrightarrow{f}}^{T}\dot{L}\overrightarrow{f}\) with \( L=I-\dot{W}\) known as the normalized graph Laplacian (recall that \( {W}^{Y}={D}^{-1/2}W{D}^{-1/2}\) is the symmetrically normalized weight matrix W). Differentiating and setting the derivative to zero, we get \( \overrightarrow{f}=c{\left((1+c)I-\dot{W}\right)}^{-1}\overrightarrow{y}\). Using \( \lambda =\frac{1}{1+c}\) we can write as follows: \( \overrightarrow{f}=(1-\lambda ){(I-\lambda \dot{W})}^{-1}\overrightarrow{y}\). Thus, this version of LPA uses the ­symmetrically normalized W that we used to derive Eq. 4.
     
  6. 6.

    LPA with symmetrically and asymmetrically normalized weight matrix. Empirically, we have observed better performance when using the symmetrically normalized \( \dot{W}\) and so we ­suggest using this form of LPA in practice. Furthermore, because \( \dot{W}\) is symmetric, we can solve Eq. 4 efficiently using Conjugate Gradient, which scales with the number of nonzero elements in \( \dot{W}\).

     
  7. 7.

    Constructing a Composite Network by Averaging FLNs. Several previous studies have found that a simple averaging of underlying networks often results in composite networks that have comparable performance to those constructed with more sophisticated methods, for example, those that assign network weights \( {\alpha }_{d}\) (24, 48). However, simple averaging will suffer when many of the underlying networks are redundant (or represent similar information); for example, there are often many more coexpression networks than other data types. One simple way to correct for redundancy is to simply group networks together based on their type and assign each type an equal weight: e.g., \( {W}^{*}=\frac{1}{3}{W}_{\mathrm{exp}}+\frac{1}{3}{W}_{gi}+\frac{1}{3}{W}_{pi},\) where Wexp,Wgi, and Wpi represent the average network of all coexpression, genetic interactions, and protein interactions networks, respectively.

     

References

  1. 1.
    Marcotte, E.M., et al., Detecting protein function and protein-protein interactions from genome sequences. Science, 1999. 285(5428): p. 751–3.Google Scholar
  2. 2.
    Wu, X., et al., Network-based global inference of human disease genes. Mol Syst Biol, 2008. 4: p. 189.Google Scholar
  3. 3.
    Aerts, S., et al., Gene prioritization through genomic data fusion. Nat Biotechnol, 2006. 24(5): p. 537–44.Google Scholar
  4. 4.
    Sharan, R., I. Ulitsky, and R. Shamir, Network-based prediction of protein function. Mol Syst Biol, 2007. 3: p. 88.Google Scholar
  5. 5.
    Oti, M. and H.G. Brunner, The modular nature of genetic diseases. Clin Genet, 2007. 71(1): p. 1–11.Google Scholar
  6. 6.
    Ashburner, M., et al., Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 2000. 25(1): p. 25–9.Google Scholar
  7. 7.
    Ogata, H., et al., KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res, 1999. 27(1): p. 29–34.Google Scholar
  8. 8.
    Ruepp, A., et al., The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res, 2004. 32(18): p. 5539–45.Google Scholar
  9. 9.
    Robinson, P.N., et al., The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am J Hum Genet, 2008. 83(5): p. 610–5.Google Scholar
  10. 10.
    Hamosh, A., et al., Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res, 2005. 33(Database issue): p. D514–7.Google Scholar
  11. 11.
    Chua, H.N., W.K. Sung, and L. Wong, Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics, 2006. 22(13): p. 1623–30.Google Scholar
  12. 12.
    Zhou, X., M.C. Kao, and W.H. Wong, Transitive functional annotation by shortest-path analysis of gene expression data. Proc Natl Acad Sci USA, 2002. 99(20): p. 12783–8.Google Scholar
  13. 13.
    Myers, C.L., et al., Discovery of biological networks from diverse functional genomic data. Genome Biol, 2005. 6(13): p. R114.Google Scholar
  14. 14.
    Karaoz, E., et al., Protective role of melatonin and a combination of vitamin C and vitamin E on lung toxicity induced by chlorpyrifos-ethyl in rats. Exp Toxicol Pathol, 2002. 54(2): p. 97–108.Google Scholar
  15. 15.
    Deng, M., et al., Prediction of protein function using protein-protein interaction data. J Comput Biol, 2003. 10(6): p. 947–60.Google Scholar
  16. 16.
    Nabieva, E., et al., Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics, 2005. 21 Suppl 1: p. i302–10.Google Scholar
  17. 17.
    Mostafavi, S., et al., GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biol, 2008. 9 Suppl 1: p. S4.Google Scholar
  18. 18.
    Tsuda, K., H. Shin, and B. Scholkopf, Fast protein classification with multiple networks. Bioinformatics, 2005. 21 Suppl 2: p. ii59-65.Google Scholar
  19. 19.
    Murali, T.M., C.J. Wu, and S. Kasif, The art of gene function prediction. Nat Biotechnol, 2006. 24(12): p. 1474–5; author reply 1475–6.Google Scholar
  20. 20.
    Deng, M., T. Chen, and F. Sun, An integrated probabilistic model for functional prediction of proteins. J Comput Biol, 2004. 11(2–3): p. 463–75.Google Scholar
  21. 21.
    Mostafavi, S. and Q. Morris, Fast Integration of Heterogeneous Data Sources for Predicting Gene Function with Limited Annotation. Bioinformatics, 2010.Google Scholar
  22. 22.
    Lanckriet, G.R., et al., A statistical framework for genomic data fusion. Bioinformatics, 2004. 20(16): p. 2626–35.Google Scholar
  23. 23.
    Pena-Castillo, L., et al., A critical assessment of Mus musculus gene function prediction using integrated genomic evidence. Genome Biol, 2008. 9 Suppl 1: p. S2.Google Scholar
  24. 24.
    Pavlidis, P., et al., Learning gene functional classifications from multiple data types. J Comput Biol, 2002. 9(2): p. 401–11.Google Scholar
  25. 25.
    Zhang, B. and S. Horvath, A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol, 2005. 4: p. Article17.Google Scholar
  26. 26.
    Yona, G., et al., Effective similarity measures for expression profiles. Bioinformatics, 2006. 22(13): p. 1616–22.Google Scholar
  27. 27.
    Warde-Farley, D., et al., The GeneMANIA ­prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res, 2010. Accepted(Webserver Issue).Google Scholar
  28. 28.
    Costanzo, M., et al., The genetic landscape of a cell. Science, 2010. 327(5964): p. 425–31.Google Scholar
  29. 29.
    Tong, A.H., et al., Global mapping of the yeast genetic interaction network. Science, 2004. 303(5659): p. 808–13.Google Scholar
  30. 30.
    Weirauch, M.T., et al., Information-based methods for predicting gene function from systematic gene knock-downs. BMC Bioinformatics, 2008. 9: p. 463.Google Scholar
  31. 31.
    Hishigaki, H., et al., Assessment of prediction accuracy of protein function from protein--protein interaction data. Yeast, 2001. 18(6): p. 523–31.Google Scholar
  32. 32.
    Schwikowski, B., P. Uetz, and S. Fields, A network of protein-protein interactions in yeast. Nat Biotechnol, 2000. 18(12): p. 1257–61.Google Scholar
  33. 33.
    Zhou, D., et al., Learning with Local and Global Consistency, in Neural Information Processing Systems. 2003, MIT Press: Vancouver, BC, Canada.Google Scholar
  34. 34.
    Weston, J., et al., Protein ranking: from local to global structure in the protein similarity network. Proc Natl Acad Sci USA, 2004. 101(17): p. 6559–63.Google Scholar
  35. 35.
    Hu, P., H. Jiang, and A. Emili, Predicting protein functions by relaxation labelling protein interaction network. BMC Bioinformatics, 2010. 11 Suppl 1: p. S64.Google Scholar
  36. 36.
    Bengio, Y., O. Delalleau, and N. Le Roux, Label Propagation and Quadratic Criterion, in Semi-Supervised Learning, O. Chapelle, B. Scholkopf, and A. Zien, Editors. 2006, MIT Press.Google Scholar
  37. 37.
    Chung, F., Spectral Graph Theory. Number 92 in CBMS Regional Conference Series in Mathematics. 1999: American Mathematical Society.Google Scholar
  38. 38.
    Vazquez, A., et al., Global protein function prediction from protein-protein interaction networks. Nat Biotechnol, 2003. 21(6): p. 697–700.Google Scholar
  39. 39.
    Karaoz, U., et al., Whole-genome annotation by using evidence integration in functional-linkage networks. Proc Natl Acad Sci USA, 2004. 101(9): p. 2888–93.Google Scholar
  40. 40.
    Fraser, A.G. and E.M. Marcotte, A probabilistic view of gene function. Nat Genet, 2004. 36(6): p. 559–64.Google Scholar
  41. 41.
    Lee, I., et al., A probabilistic functional network of yeast genes. Science, 2004. 306(5701): p. 1555–8.Google Scholar
  42. 42.
    Myers, C.L. and O.G. Troyanskaya, Context-sensitive data integration and prediction of ­biological networks. Bioinformatics, 2007. 23(17): p. 2322–30.Google Scholar
  43. 43.
    Huttenhower, C., et al., Exploring the human genome with functional maps. Genome Res, 2009. 19(6): p. 1093–106.Google Scholar
  44. 44.
    Noble, W.S. and A. Ben-Hur, Integrating Information for Protein Function Prediction, in Bioinformatics-From Genomes to Therapies, T. Lengauer, Editor. 2007, Wiley-VCH Verlag GmbH & Co KGaA: Weinheim, Germany.Google Scholar
  45. 45.
    Song, J. and M. Singh, How and when should interactome-derived clusters be used to predict functional modules and protein function? Bioinformatics, 2009. 25(23): p. 3143–50.Google Scholar
  46. 46.
    Myers, C.L., et al., Finding function: evaluation methods for functional genomic data. BMC Genomics, 2006. 7: p. 187.Google Scholar
  47. 47.
    Zhu, X., J. Lafferty, and Z. Ghahramani. Semi-supervised learning using Gaussian fields and harmonic functions. in International Conference on Machine Learning. 2003. Washington DC, USA.Google Scholar
  48. 48.
    Lewis, D.P., T. Jebara, and W.S. Noble, Support vector machine learning from heterogeneous data: an empirical analysis using protein sequence and structure. Bioinformatics, 2006. 22(22): p. 2753–60.Google Scholar
  49. 49.
    Warde-Farley, D., et al., The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res, 2010. 38 Suppl: p. W214–20.Google Scholar
  50. 50.
    Alexeyenko, A. and E.L. Sonnhammer, Global networks of functional coupling in eukaryotes from comprehensive data integration. Genome Res, 2009. 19(6): p. 1107–16.Google Scholar
  51. 51.
    von Mering, C., et al., STRING: known and predicted protein-protein associations, ­integrated and transferred across organisms. Nucleic Acids Res, 2005. 33(Database issue): p. D433–7.Google Scholar
  52. 52.
    Guan, Y., et al., A genomewide functional ­network for the laboratory mouse. PLoS Comput Biol, 2008. 4(9): p. e1000165.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Sara Mostafavi
    • 1
  • Anna Goldenberg
    • 2
  • Quaid Morris
    • 3
    • 4
  1. 1.Department of Computer Science, Centre for Cellular and Biomolecular Research (CCBR)University of TorontoTorontoCanada
  2. 2.Banting and Best Department of Medical Research, Centre for Cellular and Biomolecular ResearchUniversity of TorontoTorontoCanada
  3. 3.Department of Computer Science, Banting and Best Department of Medical ResearchCentre for Cellular and Biomolecular ResearchTorontoCanada
  4. 4.Department of Molecular GeneticsUniversity of TorontoTorontoCanada

Personalised recommendations