Predicting links between tumor samples and genes using 2Layered graph based diffusion approach
Abstract
Background
Determining the association between tumor sample and the gene is demanding because it requires a high cost for conducting genetic experiments. Thus, the discovered association between tumor sample and gene further requires clinical verification and validation. This entire mechanism is timeconsuming and expensive. Due to this issue, predicting the association between tumor samples and genes remain a challenge in biomedicine.
Results
Here we present, a computational model based on a heat diffusion algorithm which can predict the association between tumor samples and genes. We proposed a 2layered graph. In the first layer, we constructed a graph of tumor samples and genes where these two types of nodes are connected by “hasGene” relationship. In the second layer, the gene nodes are connected by “interaction” relationship. We applied the heat diffusion algorithms in nine different variants of genetic interaction networks extracted from STRING and BioGRID database. The heat diffusion algorithm predicted the links between tumor samples and genes with mean AUCROC score of 0.84. This score is obtained by using weighted genetic interactions of fusion or cooccurrence channels from the STRING database. For the unweighted genetic interaction from the BioGRID database, the algorithms predict the links with an AUCROC score of 0.74.
Conclusions
We demonstrate that the genegene interaction scores could improve the predictive power of the heat diffusion model to predict the links between tumor samples and genes. We showed the efficient runtime of the heat diffusion algorithm in various genetic interaction network. We statistically validated our prediction quality of the links between tumor samples and genes.
Keywords
Graph Heat Diffusion Prediction Tumor Genes InteractionBackground
Traditionally, the linkage and mapping between genes and diseases are well researched using genomescalesequencebased associations studies [1]. The studies of the disease causal genes are essential, but because of their timeconsuming approach to experimental validation, these methods are considered expensive. The sequencing of the genes and associations studies performs better for carefully selected functional gene candidates. This process is complicated and demands specialized knowledge [2]. Phenotypically similar diseases are often caused by functionally related genes [3], which strongly motivates the use of graph analytics and network science to study the functional relevance of the genes.
The genetic variation accounts for a proportion of susceptibility to common diseases such as diabetes, cardiovascular disease, and tumor. In the context of the tumor, DNA methylation is one of the early diagnostic markers of cancer. The differential DNA methylation status of each gene in each tumor can be verified and successfully accomplished using gene expression data in the laboratory settings. However, identifying the methylated DNA genes can only be obtained through laborious and tedious experiments. Consequently, DNA methylationbased (DNAm) studies in tumors are complicated further by disease heterogeneity [4].
In this study, we have used DNA methylation data. The DNA methylation is most intensely studied in epigenetic modifications in mammals and has important roles in studying tumorigenesis [5, 6]. A causal relationship between inflammation and cancer has long been accepted in multiple tumor types, supported by the evidence that the methylated gene observed in early dysplastic lesion [7]. The DNA methylation represents an early and crucial step in the generegulation pathway by which normal tissue experiences neoplastic transformation for the development of cancer. Further, a judgment of the methylation profiles within neoplastic tissues help the diagnosis of the disease, predicting the clinical behavior, and designing specific treatment plans for individual patients.
Every tumor samples are extracted from the individual patients. Thus conducting biological experiment to identify methylated cancer genes for each tumor samples is expensive and cumbersome. Therefore, computational methods can be a complementary approach [8, 9] because such models are faster and cheaper to perform than biological assessment.
There are large number of studies on tumortype identification in a laboratory setting. Tumor tissues(biopsy) remain the main method for analyzing cancer in most cases. When signs and symptoms indicate the probability of tumor growth in a patient, a biopsy is performed to extract tissue samples from the patient by a pathologist. If the cells in the tumor tissues are identified as malignant, then genomic analysis of tumor DNA allows for the personalized treatments for cancer [10]. As cancers are genetic diseases [11] it is the outcome from the mutations in the cells. If the tumor is malignant then cells do not reproduce sexually [12] so, geneticist finds it even harder to track the links between tumors and genes by using classical inheritance method.
From the computational point of view, a number of graphbased approaches have been explored [13, 14, 15] to prioritize diseasegene associations. The major assumption of most of the graphbased approach is that gene causing the corresponding disease links to each other in a molecular network [16, 17]. The identification of new links or edges, i.e. new genetic interactions, is the most prominent task in biological network analysis. Most of the studies for genedisease predictions make use of homogeneous networks, where all entities have the same type as well as the edges. In homogeneous networks, any shared neighbor between entities is considered the right approach for link predictions. However, in heterogeneous networks, the neighbors of an entity (or node) could have different types, and the number of shared neighbors can be a faulty parameter for predictions since it does not fully cover the graph’s heterogeneity. As a result, a different approach is required to predict novel links in a multilayered network. Supervised machine learning has been used to predict genedisease associations in heterogeneous networks. A popular approach makes use of the different relationship types in the multirelational graph as a training feature for link prediction [18, 19]. In other genedisease association studies from [20, 21, 22], the relations from the heterogeneous genedisease graph have been exploited as features to prioritize the genes. One specific challenge in supervised link prediction is the need for training data, which has to be labeled for the two different types such as link and no link which is a timeconsuming process.
Despite so much interesting research related to link prediction, all the above work avoid the impact of information diffusion mechanism on the link prediction. In the context of biological networks, the diffusionbased approach is considered important to identify genes and underlying diseases [23]. Classically, diffusion is studied mostly in a homogeneous networks [24, 25] where the information is diffused in single channels. Whereas, in the case of disease it propagates from different types of objects for e.g. disease propagate among genes and genes interact with each other. This process can be modeled as a 2layered graph. In the context of our work, the first layer is tumor sample and gene association and second is the genegene interaction layer.
Related work
In the context of general link prediction, matrix factorization is widely used. In matrix factorization, the networks are represented as matrix and entries are represented as the relationships. Menon et al.[26] claimed that link prediction is the problem of matrix completion. The low rank matrix decomposition based on Singular Value Decomposition (SVD) [27, 28] are used to predict links. For the multirelational link prediction, tensorbased factorization is prominently used. The strength of tensors is that the multirelational graph can be expressed in higherorder tensors which can be easily factorized. Unlike graphical models such as Markov Logic Networks (MLN) or Bayesian Networks, these models do not require a priori knowledge that needs to be inferred from data [29]. The matrix or tensor based factorization has three shortcomings. First, these methods do not account the structural property exhibited by networks such as high sparsity and skewed degree distribution [30]. Second, matrix or tensors based factorization methods requires the latent features or components to predict the links. It is difficult to estimate the number of latent features in advance that can give the best predictions. Third, if computational cost is an issue then matrix or tensor factorization can be very expensive and timeconsuming [31, 32].
With the boom in neural network embedding, the problems of matrix factorization are overcome. Network embedding is the method to learn a lowdimensional representation of nodes in the network preserving network structure [33]. In recent studies, a node2vec [34] approach can analyze different network neighborhoods to embed nodes based on the assumption of homophily (i.e. network communities) as well as structural equivalence (i.e. structural roles of nodes) for link prediction in a homogeneous network for same the edge type. The study by Zitnik et al.[30] extended the node2vec algorithm in a multilayered network called OhmNet, where each layer represents molecular interactions in different human tissues and reported accurate predictions for tissue cellular functions.
In recent years, the node embeddings techniques [30, 35] seems prominent because these methods have demonstrated high accuracy but it has also some limitations. These methods actually require learning steps which might be unfeasible for largescale networks with millions of nodes [36]. Similaritybased propagation methods are also well studied in predicting the links in bipartite networks. This is the classic network based propagation in recommender system that predicts most relevant objects for users [37, 38].
In the biological context, diffusionbased approach for predicting disease and genes are well studied. Network propagation has become a demanding technique in computational systems biology with the focus on protein function prediction, disease gene prioritization, and patient stratification [39]. Similarly, network propagation approach are used to study the clusterwide variety of cancer types [40, 41].
The study by Cowen et al.[23] reported that networkbased propagation is a powerful data transformation method of broad utility in genetic research. Network propagation magnifies a biological signal based on the belief that genes underlying similar phenotypes influence to interact with one another [42]. The work by Ruffalo et al.[43] suggested that using network propagation can predict the cancer driver genes which tend to cluster in the network. There are different variants of network propagation proposed such as random walks [44], PageRank [45] and heat diffusion [46] algorithms. These methods are successfully applied to study biological problem. Among them heat diffusion algorithm showed potential in prioritizing the disease gene association [47, 48] and performs the best among all networkbased diffusion approaches. Similarly, the HotNet algorithm [49] based on heat diffusion algorithm shows promising results to identify mutated genes. From the computational perspective, heat diffusion is fast to compute [39, 50, 51] and robust in memory usage [52].
The heat diffusion discussed above is different in two important ways. First, those heat diffusionbased approach is applied in a homogeneous network, meaning where nodes and the edges are of the same type. Second, we use 2 layered networks; the first is used for selecting seed nodes to carry tumor information and the second for diffusing the carried information in a genetic interaction network to predict the association between tumor samples and genes. Our approach is similar to DawnRank[53]. However, there are three key differences which are as follows. (i) Choice of the ranking algorithm which is heat diffusion instead of PageRank, (ii) Using information from multiple patients tumor sample data, (iii) Using methylated genes instead of the gene expression data.
Methods
Graph data model
We used 2 layered graph [54] to model our graph. The main motivation for using 2 layered graph is that using a single layer of Tumor and Gene graph information is not sufficient to model the complex process like tumor samples and gene link prediction. Tumors which are genetic disorders, it is important not to ignore the information about genetic interaction. Due to this, we need a separate layer to model genetic information. When gene interaction graphs are available, it is natural to incorporate them as additional information sources. This will improve the prediction accuracy of the model. The previous studies [55, 56] also showed that by the integration of multiple graphs can achieve higher accuracy than any single graph alone.
Construction of network layer 1
The methylation level reported in the data is based on the beta value. The betavalue is the estimate of methylation level using the ratio of intensities between methylated and unmethylated genes. The genome location is based on the CpG targeted by the probe in the coding region. Our bipartite network consists of two disjoint sets of nodes: one set corresponds to the tumor samples; the other set corresponds to all the methylated genes in each tumor samples. The edge between the tumor sample and gene is based on the fact reported in the data. For instance, “TCGAB6A0RG01” is a tumor sample and “HOXC4” is a methylated gene reported in the data then we link this relation by “hasGene” edge which is \(\left [\text {TCGAB6A0RG01}\ \underrightarrow {hasGene}\ \text {HOXC4}\right ]\).
Construction of network layer 2
Network layer 2 is the interaction graph between genes as shown in Fig. 1. For the construction of network layer 2, we used 9 different variants of protein interaction channels. One from BioGRID and eight from STRING proteinprotein interaction databases. These 2 databases are publicly available.
The prior studies by [47, 58, 59] modeled gene interaction graph as the undirected graph. We took the same approach and modeled our gene interaction graph as the undirected graph.
We used the proteinprotein interaction links with weights for the homosapiens class from latest STRING version 10.5 database. There are eight different weighted channels of the proteinprotein interaction networks available in STRING which are as follows: coexpression, cooccurrence, database, experimental, fusion, neighborhood, textmining and combined.
From BioGRID Api^{3} we constructed GeneGene physical network. BioGRID database provides the protein interactions curated from the biomedical literature [60] and has provided well validated physical interactions. The previous studies by [61, 62, 63, 64] has shown that the potential of prioritizing the genes based on the physical properties. This network is the unweighted network.
To transform the original protein interaction network into a gene interaction network for both STRING and BioGRID data, we took the following approaches: (i) Protein names were mapped to their encoding genes by parsing of EnsEMBL files [65]. (ii) In the case of genes encoding multiple proteins, we took the edge of maximum (integrated) weight connecting any pair of proteins encoded by such genes. Similar technique for protein to gene mapping has also been used by the prior studies [48].
Network summary of 2layered tumorgene graph
Property  Value 

Number of tumor samples  4086 
Number of genes  4071 
Number of relations between tumor samples and genes (hasGene)  222252 
Cooccurrence  1166 
Coexpression  208470 
Database  23169 
Experimental  170642 
Fusion  98 
Neighborhood  18929 
Text mining  322883 
Combined  358627 
Physical  18395 
Solution approach
Our second step is to apply the heat diffusion algorithm. For the execution of the heat diffusion algorithm, we need seed nodes. The seed nodes carry information about tumor samples. This information is available from the network layer 1 and diffused to genegene interaction layer. Once the diffusion process is over, then we get the association score between every tumor samples and genes. These association scores provide tumor samples and genes prediction.
Model description
Heat diffusion model
Heat diffusion is the usual physical phenomenon. In a medium, heat always flows from a high temperature to a low temperature. The heat diffusionbased approaches have been successfully applied in various domains such as web spamming in web graph analysis [66], recommender system [67, 68] and diseasegene prioritization [47]. To make it selfcontained, we will briefly introduce the heat diffusion model for weighted and unweighted undirected graphs which is adapted from Yang et al.[66].
Heat flow on an undirected and unweighted genegene interaction graph
We inferred the unweighted graph as the graph which has no edge weights. In the case of undirected and unweighted graph, the edge (g_{i},g_{j}) is considered as a pipe from where the heat flows and connects gene nodes g_{i} and g_{j}.

The heat obtained M(i,j,t,Δt) is proportional to the time period Δt.

The heat obtained M(i,j,t,Δt) is proportional to the heat difference f_{j}(t)−f_{i}(t).

d(g_{j}) is the degree of the gene node g_{j}.

f(0) is the initial heat vectors of the genes.

f(1) is the final heat vectors of the genes.
Heat flow on an undirected and weighted geneGene interaction graph

RH is proportional to the time period Δt.

RH should be proportional to the weight w_{ji} of the undirected edge (g_{j},g_{i}).

RH should be proportional to the heat at node g_{j}.

RH is zero if there is no link between g_{j} to g_{i}. As a result, g_{i} will receive \(\sum _{j:(g_{j},g_{i}) \in E} \sigma _{j}w_{ji}f_{j}(t)\Delta t\) amount of heat from its neighbors that are connected to it.

\(\sigma _{j} = \frac {\alpha }{d({g_{j}})}\) where d(g_{j}) is the out degree of the gene node g_{j} and α is the thermal conductivity.

The heat DH(i,t, Δ t) should be proportional to the time period Δt.

The heat DH(i,t, Δ t) should be proportional to the heat at node g_{i}.

Each node has the same ability to diffuse the heat.

The heat DH(i,t, Δ t) should be distributed to its neighboring nodes proportional to the weight on each edges.

τ is the flag to check whether the node has any outgoing links. If there is any outgoing links then τ=1 else τ=0
The matrix \(e^{\alpha H^{*}}\) is called as diffusion kernel which means the heat diffusion process continues infinitely many times from the initial heat diffusion. The parameter α is called as thermal conductivity. Higher the value of α faster is the spread of heat in the network. If α is infinitely large then heat diffuse from one node to another quickly.
In the context of genegene interaction, studies by [69, 70, 71] shown that the genes can randomly interact between adjacent genomic fragments. Thus, in order to capture that behavior, we add uniform random relations among different genes. Let γ denotes the probability of not forming random interactions and (1 γ) is the probability of taking a “random jump”. This behavior is also called “teleport” operation in the computation of PageRank [45] in web graph. The real world application considers the random edges [66] so, we followed the same setting of γ = 0.85 as in PageRank in all of our experiment.
Where H can be replaced either H^{∗} or H^{∗∗} depending upon the kind of graph used.
Computational complexity
where M is the positive integer and I is the identity matrix. In order to reduce the computational complexity, we apply three methods: (1) Since f(0) is a vector, we iteratively calculate \((I + \frac {\alpha }{M}R)^{M} f(0)\) by applying the operator \((I + \frac {\alpha }{M}R)^{M} \) to f(0). (2) For matrix R, we apply a data structure which only stores information of nonzero entries, since it is a sparse matrix. (3) For every heat source which is tumor samples in our case, we bind it by diffusing heat to its neighbors. The selection of α and M parameters is detailed described in “Experiments” section. Specifically, after using the discrete formalization of the complexity of the heat diffusion algorithm in our model is given by O(MET), where M is the number of iterations, T is the number of tumor nodes and E is the number of edges in the genegene interaction graph.
In the next section, we focus on how we use the diffusion model to predict Tumor samples and Gene relationships.
Tumor gene predictions in a toy network
With the diffusion model described in the above section, we can now make the prediction by the following approach:
The initial values of the vector f(0) is given by: f(0)=[ 1,1,1,0]^{T}. We see in this vector the position of Gene D is 0 because there is no connection from Tumor X to Gene D.
Thus the computed f(1) vector is given by f(1)=[1.0,1.0,0.5,0.43]. Now normalizing the each vectors in f(1) by sum of all the numbers in f(1) then f(1) = [0.34,0.34,0.17, 0.14]. Here the interesting thing to observe is at the position of Gene D. This value was initially 0 after diffusion and normalization we saw the value to be 0.14. This value is the likelihood of Tumor X to form link with Gene D.
Experiments
From Eq. 6 we observed the two parameter α and M. We used the AUCROC evaluation metric which is commonly used in medical sciences and machine learning communities for quantifying the accuracy of prediction algorithm [72]. The brief description of this metric is provided in the “Evaluation metrics” section. The parameters alpha (α) and iterations (M) are estimated from 10fold crossvalidation on the training sets and applied the learned parameter in the test sets. Thus, AUCROC reported in the test set is the average score across ten folds.
Impact of parameter α
Parameter α also known as thermal conductivity plays important role in heat diffusion process. If α is set to high then heat diffuses faster. Contrarily, heat diffuses slower. We varied α from the range 0 ≤α≤1. When α is set to 0, that means no diffusion and the temperature distribution will remain exactly at the initial values than the structure of the graph.
Impact of parameter M
Physical meaning of parameterα and m
The value for alpha and number of iteration M are computationally determined in the crossvalidation mode in the training sets. These 2 parameters are quite hard to determine in advance without accessing any data. It has to be noted that the optimal performance obtained by maximizing alpha and M in our studies is based in terms of link prediction accuracy results. The mechanistic meaning of alpha is that if it approaches infinity the diffusion reaches to equilibrium then all the connected nodes receive the same diffusion contribution which is similar as PageRank. If we set alpha as 0 then there is no diffusion. Due to this, we need to find the optimum alpha that can balance the extent of heat which diffuses from genes to its immediate neighbors and to the rest of the network.
Similarly, M which is the number of iteration, the impact of this parameter is that how far the heat diffuses from the seed genes. The physical meaning of the scalar parameter M is the total time of diffusion, which controls the amount of heat to which the initial signal is allowed to spread over the network. The probabilistic interpretation for this computations is that if the input values are preference binary vector which is in our case (1 means genes having an association with tumor samples and 0 means no association) of starting positions for heat diffusing across the edges of the genetic interaction graph, the final value is the position distribution after M iterations. If M tends to infinity then the probability distribution approaches to a uniform distribution over all the genes.
Runtime performance
Runtime comparison on various genegene interaction channels
Data  Total run time  

Iteration 2  Iteration 6  Iteration 30  
Cooccurrence  29.05s  63.37s  242.24s 
Experimental  27.01s  68s  267.60s 
Fusion  45.65s  74.41s  278.78s 
Neighborhood  25.62s  67.13s  260.46s 
Textmining  29.55s  75.20s  264.93s 
Coexpression  26.23s  65.97s  247.96s 
Physical  25.46s  65.56s  234.31s 
Combined  24s  61.15s  230s 
Database  25.57s  57.28s  233s 
From Fig. 6, we observed the impact of parameter M. At iteration = 6 the heat diffusion algorithm converged for Fusion channel in STRING data and for iteration = 5 for the physical channel in BioGRID data. For the rest of the channels, algorithm converged in less than 6 iterations. This means in maximum the algorithm takes 74.41 s to give us the prediction results for 4086 tumor samples and 4071 genes.
Degree biasedness
Impact of degree in genegene interaction channels
Data  Average node degree  AUCROC score without degree normalization  AUCROC score with degree normalization 

Cooccurrence  6.08  0.50  0.84 
Experimental  90.35  0.52  0.71 
Fusion  1.66  0.80  0.84 
Neighborhood  48.22  0.50  0.81 
Textmining  159.4  0.52  0.75 
Coexpresion  106.01  0.52  0.74 
Physical  9.03  0.50  0.68 
Combined  176.92  0.52  0.75 
Database  20.19  0.50  0.75 
From the results shown in the Table 3, we observed that after degree normalization the results are improved.
Examples of tumor gene prediction results
In total, we randomly selected 10 different tumor samples from our test set covering ten different tumor anatomical locations. Of the 8 interaction channels from STRING, we showed the result from fusion interaction channel because: (i) it gave us high mean AUCROC score compare to rest of channels (see “Results” section) (ii) as our study is related with tumor, gene fusions have been increasingly detected by nextgeneration sequencing (NGS) technologies based methods in malignant tumors [73,74].
We ranked the top 5 Genes predicted for each tumor sample based on their diffusion score.
Example of tumor samples and gene predictions (α = 1.0)
Tumor sample name  Tumor location  alpha = 1.0 STRING (Top 5 Predicted Genes)  alpha = 1.0 BioGRID (Top 5 predicted genes) 

TCGACVA6JN01  Upper aerodigestive tract  RRBP1 SPRY2 ZMAT4 HOXA2 IRS4  HOXA2 SPRY2 IRS4 ZMAT4 RRBP1 
TCGASXA71W01  Kidney  EPM2AIP1 C4orf26 TM4SF1 DDAH2 FBXO2  UBC APP PLVAP HNRNPA1 ALB 
TCGA38727101  Lung  HOXC4HOXA3HOXB3HOXB5 HOXD3  HOXC4 APP PRMT6 WHSC1L1 HOXB5 
TCGAHPA5MZ01  Liver  GNE GJC2 LPPR2 ACACB CLASP1  CLASP1 LPPR2 GNE GJC2 ALB 
TCGAD5654101  Large intestine  EDARADD AGRN ZMAT4 MAGI2 LAMA4  EDARADD MAGI2 LAMA4PCDHGC3ZMAT4 
TCGAQUA6IM01  Prostrate  SRI ZIC2 ZIC5 LPIN1 IGF2BP1  SRI ZIC2 NAGK POU5F1 TEAD2 
TCGAYCA8S601  Urinary tract  PRELID1 SRM LSM4 POLR2E CCT7  PRELID1 CRYAB TNFRSF1A STOML2 APP 
TCGAA7A5ZX01  Breast  C11orf53 PCDH17 KIFC2 ZFR2 PON3  PCDH17 KIFC2 ZFR2 MAPK8IP2 ILF2 
TCGAEOA3KW01  Endometrium  WT1AS C20orf96 NLRP10 TP53I13 OR4D6  PCDHGB1 APP PCDHA10 PCDHGA5 HSPB2 
TCGADJA3VE01  Thyroid  MSLN ZMAT3 TNFRSF1A SCNN1A MTA1  RDH5 MSLN ZMAT3 TNFRSF1A DUSP6 
The results shown in (Table 4) are randomly selected tumor samples from each anatomical locations and its ranked gene associations using fusion genetic interaction channel. We only showed the top 5 ranked genes examples for the particular tumor samples for the illustration purpose. This might have limited our results for not showing some of the potential and frequent cancer genes. So, we particularly took out the tumor samples for the breast and rerun our experiments this showed up the Tier 1 known cancer gene for breast such as TP53, EGFR and BRCA1 in the top 100 list. The gene EGFR is shown up 15 times in the top 100 prediction list for breast tumor samples whereas BRCA1 and TP53 are shown up 2 times. The top 100 tumors and gene association showed the primary histology of breast carcinoma from cosmic methylation data. This result is shown in the Additional file 2: Table S2.
The heat diffusion algorithm has advantages for two reasons in our datasets (i) it can give us close link prediction between tumor sample and genes, (ii) it is faster to compute and robust in memory usage, which saves computational cost. [52].
Results
In this section, we evaluate prediction accuracy of heat diffusion algorithm on STRING and BioGRID datasets.
Evaluation metrics
We conducted the crossvalidation by partitioning all tumor samples and gene relationships into ten folds and deleting the hasGene information of the tumor sample in the test set. We computed heat diffusion scores and ranked all tumor by their reconstructed tumor sample  gene relationship and recorded the Area Under the Receiver Operating Characteristics curve (AUCROC). The AUCROC metric can be understood as the probability that randomly chosen missing link is given a higher score than a randomly chosen nonexistent link [79].

The observed links E is randomly split into two parts: the training set E_{train} is treated as known information, while the test set E_{test} is used for testing and no information in the test set is allowed to be used for prediction. Thus total existing edge is then, E=E_{train}∪E_{test} and E_{train}∩E_{test}=ϕ.
 Theoretically, this metric is computed as : AUC=(n^{′}+0.5n^{″})/n. Where,
 1
n’: Number of time the missing links (links in E_{test}) have a higher score than the nonexisting links.
 2
n”: Number of times the scores of missing links is equal to a number of times the score of nonexisting links (links in U−E), where U is the universal set.
 3
n: Number of independent comparison between missing and nonexisting links.
 1
This technique has been widely discussed in the link prediction literature [72,79–81]. If the AUC score exceeds 0.5 which means how better the algorithm performs than by pure chance. The proportion of the positives (links) and the negatives(no links) in our whole set data is 0.0133. This distribution is computed from the percentage of non zero entries from Tumor Samples and Gene Matrix from network layer 1.
The studies by [47,83,84] suggested that gene expression data potentially help in prioritize diseasegene associations. In Fig. 7, we observed mean AUCROC score of 0.79 using coexpression channel for predicting links between tumor sample and genes. The previous studies also suggested that [47,83,84] gene expression data potentially help in prioritizing diseasegene associations.
Consecutively, the other three genetic interaction approaches (i) textmining, (ii) database and (iii) combined channels each have mean AUCROC scores of 0.78. One reason for the combined channel to perform similar as textmining and database channel is that it contains scores of all the channels. As textmining approach might contain noise in the data which would have influenced the similar mean AUCROC score.
For BioGRID physical interaction channel without any weights between the gene pairs, the heat diffusion algorithm showed mean AUCROC scores of 0.74. This shows the potential of network propagation methods by only using network topology to predict the gene associations [23,61].
Location based prediction
This result is a demonstration of how tumor samples are related to gene predictions. We observed the highest AUCROC of 0.86 for predicting lung tumor samples and genes association. The lowest AUCROC of 0.72 for predicting thyroid tumor samples and gene associations. This means our computational networkbased propagation and the data we used are not sufficient to explain the underlying mechanism of tumor gene associations in the thyroid section. Biologically, the study by [85,86] has also described the concept of aggressive clone and tumor heterogeneity in the case of thyroid tissue. So, this makes predicting the association between thyroid tumor samples and gene associations difficult.
Comparison with the baseline algorithms
We compare our results from heat diffusion algorithms with baseline link prediction algorithm. To do this we applied several algorithms for link prediction such as scores based on similarity metrics namely Common Neighbors, Jaccard similarity, Adamic/Adar, Preferential Attachment and Resource Allocations. These algorithms are also called node based topological similarity algorithms because they can be viewed as computing a measure of ‘proximity" or “similarity" between nodes [87].

Random Baseline This simply assign each candidate edge a random score. This score is meant for the benchmark to compare other algorithms.

Node Based The link prediction metrics assigns score of each candidate edges. These metrics presented by [87] are widely used in link prediction problem. However, using two different node sets tumor samples and genes cannot be directly applied in the context of bipartite graph because the neighbors of nodes on opposite sides of the network do not intersect. As the bipartite graph between tumor samples and genes is a directed graph we use the outgoing neighbors of tumor samples to the outgoing neighbors of the incoming neighbors of genes. If we consider tumors samples as x and genes as y then,
The terms used in the Equation below can be described as:
N_{out}(x) denotes outgoing neighbors of node x.

\(N^{\prime }_{out,in}(y)\) can be interpreted as follows: (i) set of all the incoming neighbors of node y. (ii) From the list of neighbor of node y get all the list of outgoing neighbors.
 1
Common Neighbors: \(score (x,y) = N_{out}(x) \cap N^{\prime }_{out,in}(y)\)
 2
Jaccard’s Coefficient: score(x,y)=
\(\frac {N_{out}(x) \cap N^{\prime }_{out,in}(y)}{N_{out}(x) \cup N^{\prime }_{out,in}(y)}\)
 3
Adamic/Adar:
\(score (x,y) = \sum _{z \in N_{out}(x) \cap N^{\prime }_{out,in}(y)} \frac {1}{log N_{out} z}\)
 4
Preferential Attachment:
\(score (x,y) = N_{out}(x) \cdot  N^{\prime }_{out,in}(y)\)
 5
Resource Allocation :
\(score (x,y) = \sum _{z \in N_{out}(x) \cap N^{\prime }_{out,in}(y)} \frac {1}{N_{out} z}\)

 Path Based Path based link prediction is based on the paths from one node to another. The two nodes are likely to be connected if there exist more paths between them. We employed the following metrics to compute the score between two sets of nodes:
 1
Katz: \(score (x,y) = \sum _{i = 1}^{\infty } \beta ^{l} \cdot paths_{x,y}^{< l>}\)
 2
Personalized PageRank: score(x,y) is explained as the probability of node y in a random walk that returns to node x with a probability α at each step, moving to a random neighbor with probability 1−α
 1
Similarly, link prediction algorithm is also evaluated using AUCPR metric. This metric is considered as more informative with heavy class imbalance problem such as link prediction [29,88,89]. This metric can perform robustly in a noisy environment [90]. Hence, in this study, we showed both AUCROC and AUCPR evaluation metric for the link prediction by heat diffusion algorithm for the comparison with baseline.
The result of 10 fold cross validation using heat diffusion algorithm with baseline prediction
Method  AUCROC  AUCPR 

Common neighbor approach (CN)  0.72  0.0382 
Jaccard similarity (JS)  0.76  0.0549 
Preferential attachment (PA)  0.73  0.0428 
Resource allocation (RA)  0.78  0.0612 
Adamic adar index (AAI)  0.79  0.0645 
Katz (STRING) α=0.15,β=0.0001  0.54  0.0161 
Katz (BioGRID) α=0.15,β=0.0001  0.72  0.0311 
Personalized pageRank algorithm (PPR) (STRING)  0.81  0.0523 
Personalized PageRank algorithm (PPR) (BioGRID)  0.81  0.0543 
Heat diffusion algorithm (HD) (STRING)  0.85  0.0823 
Heat diffusion algorithm (HD) (BioGRID)  0.74  0.0321 
Random baseline (RB)  0.5  0.0125 
Random network in layer 2 (RN)  0.5  0.0012 
From Figs. 9 and 10, we see the disagreement between AUCROC and AUCPR score in a link prediction task. AUCPR curves consider the only prediction of the positives and are generally used for problems common in information retrieval, where negatives dominate the positives and are not considered important. For link prediction problem, AUCPR curves give credit for correctly predicting edges but do not give credit for correctly predicting nonedges. This metric is heavily focused on predicting positives. This behavior harshly penalized for nonedges prediction. We believe this is one of the reasons for the discrepancies between the two metrics for the prediction performed by the algorithm. Whereas AUCROC is expected to be balanced metric for evaluating the accuracy of link prediction considering both edges and nonedges of the nodes. The prior study by [91] has also covered the anomalies of these discrepancies in the context of link prediction.
The figure indicates the pvalues of the ttest at significant level α=0.05, *** indicates highly significant
CN  JS  PA  RA  AAI  Katz (STRING)  Katz (BioGRID)  PPR(STRING)  PPR(BioGRID)  RN (STRING)  RN (BioGRID)  

HD (STRING)  
AUCROC  < 2.2e16 ***  < 2.2e16 ***  < 2.2e16 ***  < 2.2e16 ***  < 2.2e16 ***  < 2.2e16 ***    < 2.2e16 ***    1.524e14 ***   
HD (STRING)  
AUCPR  < 2.2e16 ***  < 2.2e16 ***  < 2.2e16 ***  < 2.2e16 ***  < 2.2e16 ***  < 2.2e16 ***    < 2.2e16 ***    < 2.2e16 ***   
HD (BioGRID)  
AUCROC  < 2.2e16 ***  < 2.2e16 ***  < 2.2e16 ***  8.611e14 ***  < 2.2e16 ***    2.667e14 ***    < 2.2e16 ***    8.839e13*** 
HD (BioGRID)  
AUCPR  < 2.2e16 ***  < 2.2e16 ***  < 2.2e16 ***  < 2.2e16 ***  < 2.2e16 ***    0.00042 **    < 2.2e16 ***    < 2.2e16 *** 
One of the important aspects of the heat diffusion is that it represents an exponential sum which converges more quickly in most cases than the geometric sum of Personalized PageRank [92]. This can be advantageous in the large graphs to get the desired results faster. In the biological context, similar work in prioritizing disease and genes [47] had already shown that the heat diffusion based ranking outperforms other diffusion methods in ranking diseasecausing genes.
Comparison with the hotNet2 and hotNet heat diffusion algorithms
The result of 10 Fold cross validation using heat diffusion algorithm with HotNet and HotNet2 algorithms
Methods  Mean AUCROC score  Standard deviation 

HotNet2  0.73  0.001889 
HotNet  0.82  0.001835 
Our heat diffusion method  0.84  0.001432 
The major difference between HotNet2 and HotNet is how heat diffusion is modeled. The HotNet2 algorithm is modeled for a directed network whereas HotNet is modeled for the undirected network. As we implemented our heat diffusion model for the undirected network the result between HotNet and our approach is also similar. Though our approach has marginal improvement of the prediction accuracy. So, we performed paired ttest to check whether the difference is statistically significant between the 10 fold cross validation result by both methods. The pvalue of the test (pvalue = 4.489e09) suggests there is the same problem in prediction performed by both the methods. The major technical difference in the HotNet and our heat diffusion model is how we propagate the heat. We used the discrete approximation of heat diffusion model \(\left (I+\frac {\alpha }{M}R\right)^{M}\) [66] which has linear complexity whereas HotNet uses continuous diffusion kernel e^{αR} which has cubic complexity and for a huge graph this might be a problem. The linear kernel used in our approach is regarded as the random walk through the network which is comparable to the exponential or continuous diffusion kernel.
Independent validation
In order to evaluate our heat diffusion model with independent datasets we chose STRING data because the heat diffusion model performed better in it in the crossvalidation test. So, in the independent validation, we split our STRING data into 3 parts training (60%), validation(20%) and test sets(20%). The parameters for the diffusion are chosen from the training set. The model performed the AUCROC of 0.83 and an AUCPR score of 0.078 in the test set.
Statistical significance of the tumorgene link prediction
We observed from the histogram that the large proportion of links are statistically significant (pvalue are near to zero). However, for some links, the pvalues are large, thus we have a risk of reporting false positives for a small proportion of tumor samples. This problem may be caused by the quality of the incomplete genetic interaction network from the STRING database.
As we have performed the test 1000 times, it is crucial to do the multiple hypothesis correction. The most common method to check this is to apply a Bonferroni correction test [95]. To do this we have to set up the new critical value which is \(\alpha = \frac {\alpha *}{m}\). In our setting, α∗=0.05 and m = 1000, hence the new α of the individual test would be 0.05/1000 = 5e05. Thus we consider those tumorgene links with pvalues <5e05 to be significant. From the initial permutation test, we observed that there are 130913 tumor samples and gene links to be significant. Applying the Bonferroni correction criterion, with the pvalues <5e05 we found that only 62414 tumor samples and gene link to be significant.
Biological meaning of the tumor sample and gene prediction
We examined the top 100 predictions performed by the heat diffusion algorithms. The results are ranked by heat diffusion scores. For each tumor sample and gene association, we looked at the biological meaning of the predicted links using the COSMIC database. This database collects somatic mutations from “The Cancer Genome Atlas (TCGA)" as well as from many smallerscale studies and experimental studies [96]. Our results showed that the heat diffusion algorithm predicted 5 putative cancer gene CDH10, CHST11, GRM3, VAV1 and CCR4 from Tier 2 of the Cancer Gene Census^{6}. Similarly 9 Tier 1 known cancer genes such as TBX3, CNBP, CUX1, KLF6, HOXC13, FUS, BIRC3, GNAS and TNFAIP3. These genes have documented evidence of their relevance to cancer. Not only the cancer genes but also heat diffusion identified 16 genes in which mutations are associated with altered drug sensitivity in cancer. The rest of the predicted tumor sample and genes have the evidence that the mouse insertional mutagenesis experiments support them as a cancer driver gene [97,98].
The results of the 100 predicted tumor samples and gene association with diffusion scores is provided in xlsx sheet as a Additional file 1: Table S1.
Out of the top 100 genes predicted, we found 9 tier 1 and 5 tier 2 cancer genes and 11 genes in which mutations are associated with altered drug sensitivity. The drug sensitivity information is identified by manually inspecting the COSMIC database. However, in a total of 4071 genes, we have a total of 244 (tier 1 and tier 2) cancer genes in our genegene interaction network. So we further investigated the statistical significance of the (14%) proportion of the top 100 predicted genes by randomizing genegene interaction graph using the Equation 8. The pvalue of the test is 0.015 which suggest the 14 genes related to cancers out of top 100 predictions is statistically significant.
We further performed the test randomizing the network layer 1 which is the bipartite graph between tumor samples and genes by preserving the degree distribution. We found the nominal pvalue to observe (14%) proportion of the top 100 predicted genes statistically significant (pvalue <1e3).
Discussion
We considered two different baselines to compare the results of the algorithm. One is nodebased and another is pathbased algorithms. Out of the Nodebased link prediction metrics, Adamic/Adar and Resource Allocation methods performed the best and Common Neighbor approach performed the worst in the datasets. In terms of AUCROC and AUCPR, both Adamic/Adar and Resource Allocation have similar scores. Whereas, the heat diffusion algorithm has produced more accurate predictions, surpassing Adamic/Adar and Resource Allocation by up to 7.05% in STRING data with regard to AUCROC. While in the case of BioGRID data heat diffusion algorithm did not perform better in comparison with the Jaccard Similarity, Resource Allocation, and Adamic/Adar methods. Heat Diffusion algorithm performed better than Katz scores but worst in comparison with Personalized PageRank Algorithm for BioGRID datasets. One reason that heat diffusion performed better in STRING network in comparison to BioGRID is the coverage of the network. The network created from STRING is based on the assumption that it has integrated data from different sources which might affect the prediction results. Personalized PageRank algorithm surpassed all the link prediction methods for BioGRID datasets. In terms of AUCROC, Personalized PageRank algorithm gained 8% and in terms of AUCPR 40% relative improvement over heat diffusion and Katz method for BioGRID datasets. The integration of gene interaction data in the diffusion model has proved to have a significant influence on the performance of the tumor samples and gene link prediction. One thing is also important to observe that only exploiting the link structures Personalized PageRank algorithm, outperforms several link prediction algorithms in both STRING and BioGRID datasets as shown in Table 5.
We observed heat diffusion algorithm outperform Personalized PageRank in STRING data. There is a gain of 4% in AUCROC scores using heat diffusion algorithms over Personalized PageRank algorithm which means the genetic interaction scores in heat diffusion are contributing to the improvement of prediction quality.
The heat diffusion algorithm also gives the nonexistent links which are not in the training set. In this work we did not further investigate about the nonexistent links because those evidences were not reported in the TCGA COSMIC database and we are unsure whether those are spurious or biological meaningful links. Though it is very relevant research direction to further investigate to find out the biological relevance of the nonexistent links predicted by the algorithm.
Conclusion
We presented the heat diffusion algorithm, to predict links between tumor samples and gene in a 2layer network. We used the heat diffusion algorithm to diffuse in 4086 independent tumor samples to 4071 genes. The heat is then diffused in nine different gene interaction channels. We noted that link prediction between tumor sample and genes gave us the highest AUCROC scores in fusion, cooccurrence and neighborhood channels in STRING data. The heat diffusionbased method gives us decent prediction even if no knowledge is available about the disease or phenotype and outperformed some of the baseline prediction such as Common Neighbors, Preferential Attachment and Katz methods. The other reason to choose heat diffusion is less memory intensive and faster to compute. In our experiment, we observed that the Personalized PageRank Algorithm also gave comparable results with heat diffusion methods. One of the advantages of using heat diffusion over Personalized PageRank method is that heat diffusion uses exponential sum, which converges quickly over personalized PageRank which uses geometric sum.
Future work
There are several directions for future work. One of the important aspects would be the biological validation of the results, although we showed in our experiment that some of the top predictions per anatomical locations are indeed cancer genes. Computationally, we can evaluate the likelihood of identifying cancer genes if we run our algorithm in random data.
It is important to compare our results with other forms of somatic mutation data like copy number variation. In this work, we have not evaluated different cancer data for heat diffusion model. We only evaluated heat diffusion algorithms performance with a different state of the art link prediction algorithm for DNA methylation data. It would be important to see how heat diffusion algorithms perform in other somatic mutational data and compare against DNA methylation which could be a future work.
Footnotes
Notes
Acknowledgements
We would like to thank Unit for Information Mining and Retrieval (UIMR) at Insight Center for Data Analytics for providing us the computational infrastructure.
Authors’ contributions
MT implemented the heat diffusion method. HY and DRS provided valuable advice on the method. MT collected the data and analyzed the prediction results. MT wrote the manuscript. DRS and RS provided guidance and revised the manuscript. All authors read and approved the final manuscript.
Funding
We would like to acknowledge Science Foundation Ireland (SFI/12/RC/2289) for funding this research.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Supplementary material
References
 1.Botstein D, Risch N. Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet. 2003; 33(3s):228.PubMedCrossRefGoogle Scholar
 2.Zeng X, Liao Y, Liu Y, Zou Q. Prediction and validation of disease genes using hetesim scores. IEEE/ACM Trans Comput Biol Bioinform. 2017; 14(3):687–95.PubMedCrossRefGoogle Scholar
 3.Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R. Associating genes and protein complexes with disease via network propagation. PLoS Comput Biol. 2010; 6(1):1000641.CrossRefGoogle Scholar
 4.Vidal E, Sayols S, Moran S, GuillaumetAdkins A, Schroeder M, Royo R, Orozco M, Gut M, Gut I, LopezBigas N, et al.A dna methylation map of human cancer at single basepair resolution. Oncogene. 2017; 36(40):5648.PubMedPubMedCentralCrossRefGoogle Scholar
 5.Laird PW. The power and the promise of dna methylation markers. Nat Rev Cancer. 2003; 3(4):253.PubMedCrossRefGoogle Scholar
 6.Mikeska T, Craig JM. Dna methylation biomarkers: cancer and beyond. Genes. 2014; 5(3):821–64.PubMedPubMedCentralCrossRefGoogle Scholar
 7.Kurkjian C, Kummar S, Murgo AJ. Dna methylation: its role in cancer development and therapy. Curr Probl Cancer. 2008; 32(5):187.PubMedPubMedCentralCrossRefGoogle Scholar
 8.Zheng H, Wu H, Li J, Jiang S. W.Cpgimethpred: computational model for predicting methylation status of cpg islands in human genome. BMC Med Genomics. 2013; 6(1):13.CrossRefGoogle Scholar
 9.Bhasin M, Zhang H, Reinherz EL, Reche PA. Prediction of methylated cpgs in dna sequences using a support vector machine. FEBS Lett. 2005; 579(20):4302–8.PubMedCrossRefGoogle Scholar
 10.DeVries S, Nyante S, Korkola J, Segraves R, Nakao K, Moore D, Bae H, Wilhelm M, Hwang S, Waldman F. Arraybased comparative genomic hybridization from formalinfixed, paraffinembedded breast tumors. J Mol Diagn. 2005; 7(1):65–71.PubMedPubMedCentralCrossRefGoogle Scholar
 11.Vogelstein B, Kinzler KW. Cancer genes and the pathways they control. Nat Med. 2004; 10(8):789. http://dl.acm.org/citation.cfm?id=3104482.3104584.PubMedCrossRefGoogle Scholar
 12.Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P. Finding the cancercritical genes. 2002.Google Scholar
 13.Lage K, Karlberg EO, Størling ZM, Olason PI, Pedersen AG, Rigina O, Hinsby AM, Tümer Z, Pociot F, Tommerup N, et al.A human phenomeinteractome network of protein complexes implicated in genetic disorders. Nat Biotechnol. 2007; 25(3):309–16.PubMedCrossRefGoogle Scholar
 14.Zou Q, Li J, Song L, Zeng X, Wang G. Similarity computation strategies in the micrornadisease network: a survey. Brief Funct Genomics. 2015; 15(1):55–64.PubMedGoogle Scholar
 15.Zou Q, Li J, Wang C, Zeng X. Approaches for recognizing disease genes based on network. BioMed Res Int. 2014; 2014:1–10.Google Scholar
 16.Barabási AL, Gulbahce N, Loscalzo J. Network medicine: a networkbased approach to human disease. Nat Rev Genet. 2011; 12(1):56–68.PubMedPubMedCentralCrossRefGoogle Scholar
 17.Chen Y, Xu R. Phenomebased gene discovery provides information about parkinson’s disease drug targets. BMC Genomics. 2016; 17(5):493.PubMedPubMedCentralCrossRefGoogle Scholar
 18.Sun Y, Han J, Aggarwal CC, Chawla NV. When will it happen?: relationship prediction in heterogeneous information networks. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining. ACM: 2012. p. 663–72.Google Scholar
 19.Martínez V, Berzal F, Cubero J. C.A survey of link prediction in complex networks. ACM Comput Surv (CSUR). 2016; 49(4):69.CrossRefGoogle Scholar
 20.Mordelet F, Vert JP. Prodige: Prioritization of disease genes with multitask machine learning from positive and unlabeled examples. BMC Bioinformatics. 2011; 12(1):389.PubMedPubMedCentralCrossRefGoogle Scholar
 21.Davis D, Lichtenwalter R, Chawla NV. Multirelational link prediction in heterogeneous information networks. In: Advances in Social Networks Analysis and Mining (ASONAM), 2011 International Conference On. IEEE: 2011. p. 281–8.Google Scholar
 22.SinghBlom UM, Natarajan N, Tewari A, Woods JO, Dhillon IS, Marcotte EM. Prediction and validation of genedisease associations using methods inspired by social network analysis. PLoS ONE. 2013; 8(5):58977.CrossRefGoogle Scholar
 23.Cowen L, Ideker T, Raphael BJ, Sharan R. Network propagation: a universal amplifier of genetic associations. Nat Rev Genet. 2017; 8:551–62.CrossRefGoogle Scholar
 24.Shi C, Li Y, Zhang J, Sun Y, Philip SY. A survey of heterogeneous information network analysis. IEEE Trans Knowl Data Eng. 2017; 29(1):17–37.CrossRefGoogle Scholar
 25.Gruhl D, Guha R, LibenNowell D, Tomkins A. Information diffusion through blogspace. In: Proceedings of the 13th International Conference on World Wide Web. ACM: 2004. p. 491–501. https://doi.org/10.1145/988672.988739.
 26.Menon AK, Elkan C. Link prediction via matrix factorization. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer: 2011. p. 437–52. https://doi.org/10.1007/9783642237836_28.CrossRefGoogle Scholar
 27.Kunegis J, Lommatzsch A. Learning spectral graph transformations for link prediction. In: Proceedings of the 26th Annual International Conference on Machine Learning. ACM: 2009. p. 561–8. https://doi.org/10.1145/1553374.1553447.
 28.Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q. Line: Largescale information network embedding. In: Proceedings of the 24th International Conference on World Wide Web: 2015. p. 1067–77. https://doi.org/10.1145/2736277.2741093.
 29.Nickel M, Tresp V, Kriegel HP. A threeway model for collective learning on multirelational data. In: ICML, vol. 11. Omnipress: 2011. p. 809–16. http://dl.acm.org/citation.cfm?id=3104482.3104584.
 30.Zitnik M, Leskovec J. Predicting multicellular function through multilayer tissue networks. Bioinformatics. 2017; 33(14):190–8.CrossRefGoogle Scholar
 31.Krompaß D, Nickel M, Tresp V. Largescale factorization of typeconstrained multirelational data. In: Data Science and Advanced Analytics (DSAA), 2014 International Conference On. IEEE: 2014. p. 18–24. https://doi.org/10.1109/dsaa.2014.7058046.
 32.Lever J, Gakkhar S, Gottlieb M, Rashnavadi T, Lin S, Siu C, Smith M, Jones M, Krzywinski M, Jones SJ. A collaborative filtering based approach to biomedical knowledge discovery. Bioinformatics. 2017.Google Scholar
 33.Wang D, Cui P, Zhu W. Structural deep network embedding. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM: 2016. p. 1225–34. https://doi.org/10.1145/2939672.2939753.
 34.Grover A, Leskovec J. node2vec: Scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM: 2016. p. 855–64. https://doi.org/10.1145/2939672.2939754.
 35.Perozzi B, AlRfou R, Skiena S. Deepwalk: Online learning of social representations. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM: 2014. p. 701–10. https://doi.org/10.1145/2623330.2623732.
 36.Zhou W, Jia Y. Predicting links based on knowledge dissemination in complex network. Phys A: Stat Mech Appl. 2017; 471:561–8.CrossRefGoogle Scholar
 37.Zhang Y. C., Medo M, Ren J, Zhou T, Li T, Yang F. Recommendation model based on opinion diffusion. EPL (Europhys Lett). 2007; 80(6):68003.CrossRefGoogle Scholar
 38.Zhang YC, Blattner M, Yu YK. Heat conduction process on community networks as a recommendation model. Phys Rev Lett. 2007; 99(15):154301.PubMedCrossRefGoogle Scholar
 39.Carlin DE, Demchak B, Pratt D, Sage E, Ideker T. Network propagation in the cytoscape cyberinfrastructure. PLoS Comput Biol. 2017; 13(10):1005598.CrossRefGoogle Scholar
 40.Liu Z, Zhang S. Tumor characterization and stratification by integrated molecular profiles reveals essential pancancer features. BMC Genomics. 2015; 16(1):503.PubMedPubMedCentralCrossRefGoogle Scholar
 41.Zhong X, Yang H, Zhao S, Shyr Y, Li B. Networkbased stratification analysis of 13 major cancer types using mutations in panels of cancer genes. BMC Genomics. 2015; 16(7):7.CrossRefGoogle Scholar
 42.Menche J, Sharma A, Kitsak M, Ghiassian SD, Vidal M, Loscalzo J, Barabási AL. Uncovering diseasedisease relationships through the incomplete interactome. Science. 2015; 347(6224):1257601.PubMedPubMedCentralCrossRefGoogle Scholar
 43.Ruffalo M, Koyutürk M, Sharan R. Networkbased integration of disparate omic data to identify “silent players” in cancer. PLoS Comput Biol. 2015; 11(12):1004595.CrossRefGoogle Scholar
 44.Lovász L. Random walks on graphs. Comb Paul erdos Eighty. 1993; 2:1–46.Google Scholar
 45.Page L, Brin S, Motwani R, Winograd T. The pagerank citation ranking: Bringing order to the web. 1999;:1999–66. http://ilpubs.stanford.edu:8090/422/.
 46.BenAvraham D, Havlin S. Diffusion and Reactions in Fractals and Disordered Systems. 2000. https://doi.org/10.1017/cbo9780511605826.
 47.Nitsch D, Gonçalves JP, Ojeda F, De Moor B, Moreau Y. Candidate gene prioritization by network analysis of differential expression using machine learning approaches. BMC Bioinformatics. 2010; 11(1):460.PubMedPubMedCentralCrossRefGoogle Scholar
 48.Gonçalves JP, Francisco AP, Moreau Y, Madeira SC. Interactogeneous: disease gene prioritization using heterogeneous networks and full topology scores. PLoS ONE. 2012; 7(11):49634.CrossRefGoogle Scholar
 49.Vandin F, Clay P, Upfal E, Raphael BJ. Discovery of mutated subnetworks associated with clinical data in cancer. In: Biocomputing 2012: 2012. p. 55–66. https://doi.org/10.1142/9789814366496_0006.
 50.Bourigault S, Lagnier C, Lamprier S, Denoyer L, Gallinari P. Learning social network embeddings for predicting information diffusion. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining. ACM: 2014. p. 393–402. https://doi.org/10.1145/2556195.2556216.
 51.Thanou D, Dong X, Kressner D, Frossard P. Learning heat diffusion graphs. IEEE Trans Signal Inf Process Over Netw. 2017; 3(3):484–99.CrossRefGoogle Scholar
 52.AlMohy AH, Higham NJ. Computing the action of the matrix exponential, with an application to exponential integrators. SIAM J Sci Comput. 2011; 33(2):488–511.CrossRefGoogle Scholar
 53.Hou JP, Ma J. Dawnrank: discovering personalized driver genes in cancer. Genome Med. 2014; 6(7):56.PubMedPubMedCentralCrossRefGoogle Scholar
 54.Timilsina M, Yang H, RebholzSchuhmann D. A 2layered graph based diffusion approach for altmetric analysis. In: 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). IEEE: 2018. p. 463–6. https://doi.org/10.1109/asonam.2018.8508290.
 55.Tsuda K, Noble WS. Learning kernels from biological networks by maximizing entropy. Bioinformatics. 2004; 20(suppl_1):326–33.CrossRefGoogle Scholar
 56.Lanckriet GR, De Bie T, Cristianini N, Jordan MI, Noble WS. A statistical framework for genomic data fusion. Bioinformatics. 2004; 20(16):2626–35.PubMedCrossRefGoogle Scholar
 57.Forbes SA, Beare D, Gunasekaran P, Leung K, Bindal N, Boutselakis H, Ding M, Bamford S, Cole C, Ward S, et al.Cosmic: exploring the world’s knowledge of somatic mutations in human cancer. Nucleic Acids Res. 2014; 43(D1):805–11.PubMedPubMedCentralCrossRefGoogle Scholar
 58.Köhler S, Bauer S, Horn D, Robinson PN. Walking the interactome for prioritization of candidate disease genes. Am J Hum Genet. 2008; 82(4):949–58.PubMedPubMedCentralCrossRefGoogle Scholar
 59.Pavlopoulos GA, Secrier M, Moschopoulos CN, Soldatos TG, Kossida S, Aerts J, Schneider R, Bagos PG. Using graph theory to analyze biological networks. BioData Min. 2011; 4(1):10.PubMedPubMedCentralCrossRefGoogle Scholar
 60.ChatrAryamontri A, Breitkreutz B. J., Oughtred R, Boucher L, Heinicke S, Chen D, Stark C, Breitkreutz A, Kolas N, O’donnell L, et al.The biogrid interaction database: 2015 update. Nucleic Acids Res. 2014; 43(D1):470–8.PubMedPubMedCentralCrossRefGoogle Scholar
 61.Navlakha S, Kingsford C. The power of protein interaction networks for associating genes with diseases. Bioinformatics. 2010; 26(8).PubMedPubMedCentralCrossRefGoogle Scholar
 62.Chen J, Aronow BJ, Jegga AG. Disease candidate gene identification and prioritization using protein interaction networks. BMC Bioinformatics. 2009; 10(1):73.PubMedPubMedCentralCrossRefGoogle Scholar
 63.Aragues R, Sander C, Oliva B. Predicting cancer involvement of genes from heterogeneous data. BMC Bioinformatics. 2008; 9(1):172.PubMedPubMedCentralCrossRefGoogle Scholar
 64.Li Y, Patra JC. Genomewide inferring gene–phenotype relationship by walking on the heterogeneous network. Bioinformatics. 2010; 26(9):1219–24.PubMedCrossRefGoogle Scholar
 65.Aken BL, Achuthan P, Akanni W, Amode MR, Bernsdorff F, Bhai J, Billis K, CarvalhoSilva D, Cummins C, Clapham P, et al.Ensembl 2017. Nucleic Acids Res. 2016; 45(D1):635–42.PubMedPubMedCentralCrossRefGoogle Scholar
 66.Yang H, King I, Lyu MR. Diffusionrank: a possible penicillin for web spamming. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM: 2007. p. 431–8. https://doi.org/10.1145/1277741.1277815.
 67.Ma H, Yang H, King I, Lyu MR. Learning latent semantic relations from clickthrough data for query suggestion. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM: 2008. p. 709–18. https://doi.org/10.1145/1458082.1458177.
 68.Ma H, Yang H, Lyu MR, King I. Mining social networks using heat diffusion processes for marketing candidates selection. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM: 2008. p. 233–42. https://doi.org/10.1145/1458082.1458115.
 69.Dostie J, Richmond TA, Arnaout RA, Selzer RR, Lee WL, Honan TA, Rubio ED, Krumm A, Lamb J, Nusbaum C, et al.Chromosome conformation capture carbon copy (5c): a massively parallel solution for mapping interactions between genomic elements. Genome Res. 2006; 16(10):1299–309.PubMedPubMedCentralCrossRefGoogle Scholar
 70.Chepelev I, Wei G, Wangsa D, Tang Q, Zhao K. Characterization of genomewide enhancerpromoter interactions reveals coexpression of interacting genes and modes of higher order chromatin organization. Cell Res. 2012; 22(3):490–503.PubMedPubMedCentralCrossRefGoogle Scholar
 71.Luo F, Yang Y, Zhong J, Gao H, Khan L, Thompson DK, Zhou J. Constructing gene coexpression networks and predicting functions of unknown genes by random matrix theory. BMC Bioinformatics. 2007; 8(1):299.PubMedPubMedCentralCrossRefGoogle Scholar
 72.Clauset A, Moore C, Newman ME. Hierarchical structure and the prediction of missing links in networks. Nature. 2008; 453(7191):98.PubMedCrossRefGoogle Scholar
 73.Mitelman F, Johansson B, Mertens F. The impact of translocations and gene fusions on cancer causation. Nat Rev Cancer. 2007; 7(4):233–45.PubMedCrossRefGoogle Scholar
 74.Zhao J, Li X, Yao Q, Li M, Zhang J, Ai B, Liu W, Wang Q, Feng C, Liu Y, et al.Rwcfusion: identifying phenotypespecific cancer driver gene fusions based on fusion pair random walk scoring method. Oncotarget. 2016; 7(38):61054.Google Scholar
 75.Colotti G, Poser E, Fiorillo A, Genovese I, Chiarini V, Ilari A. Sorcin, a calcium binding protein involved in the multidrug resistance mechanisms in cancer cells. Molecules. 2014; 19(9):13976–89.PubMedPubMedCentralCrossRefGoogle Scholar
 76.Xiao F, Bai Y, Chen Z, Li Y, Luo L, Huang J, Yang J, Liao H, Guo L. Downregulation of hoxa1 gene affects small cell lung cancer cell survival and chemoresistance under the regulation of mir100. Eur J Cancer. 2014; 50(8):1541–54.PubMedCrossRefGoogle Scholar
 77.Rauch TA, Wang Z, Wu X, Kernstine KH, Riggs AD, Pfeifer GP. Dna methylation biomarkers for lung cancer. Tumor Biol. 2012; 33(2):287–96.CrossRefGoogle Scholar
 78.Eklund EA. The role of hox genes in malignant myeloid disease. Curr Opin Hematol. 2007; 14(2):85–89.PubMedCrossRefGoogle Scholar
 79.Wang L, Hu K, Tang Y. Robustness of linkprediction algorithm based on similarity and application to biological networks. Curr Bioinform. 2014; 9(3):246–52.CrossRefGoogle Scholar
 80.Liu W, Lü L. Link prediction based on local random walk. EPL (Europhys Lett). 2010; 89(5):58007.CrossRefGoogle Scholar
 81.Lü L, Zhou T. Link prediction in complex networks: A survey. Phys A: Stat Mech Appl. 2011; 390(6):1150–70.CrossRefGoogle Scholar
 82.Von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M, Jouffre N, Huynen MA, Bork P. String: known and predicted protein–protein associations, integrated and transferred across organisms. Nucleic Acids Res. 2005; 33(suppl_1):433–7.Google Scholar
 83.Nica AC, Dermitzakis ET. Using gene expression to investigate the genetic basis of complex disorders. Hum Mol Genet. 2008; 17(R2):129–34.CrossRefGoogle Scholar
 84.Cookson W, Liang L, Abecasis G, Moffatt M, Lathrop M. Mapping complex disease traits with global gene expression. Nat Rev Genet. 2009; 10(3):184–94.PubMedPubMedCentralCrossRefGoogle Scholar
 85.Kuhn E, Teller L, Piana S, Rosai J, Merino MJ. Different clonal origin of bilateral papillary thyroid carcinoma, with a review of the literature. Endocr Pathol. 2012; 23(2):101–7.PubMedCrossRefGoogle Scholar
 86.Schopper HK, Stence A, Ma D, Pagedar NA, Robinson RA. Single thyroid tumour showing multiple differentiated morphological patterns and intramorphological molecular genetic heterogeneity. J Clin Pathol. 2016; 70:2016.Google Scholar
 87.LibenNowell D, Kleinberg J. The linkprediction problem for social networks. J Assoc Inf Sci Technol. 2007; 58(7):1019–31.CrossRefGoogle Scholar
 88.Lichtnwalter R, Chawla NV. Link prediction: fair and effective evaluation. In: Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012): 2012. p. 376–83. IEEE Computer Society. https://doi.org/10.1109/asonam.2012.68.
 89.GarciaGasulla D, Ayguadé E, Labarta J, Cortés U. Limitations and alternatives for the evaluation of largescale link prediction. 2016. arXiv preprint arXiv:1611.00547.Google Scholar
 90.Zhang P, Wang X, Wang F, Zeng A, Xiao J. Measuring the robustness of link prediction algorithms under noisy environment. Sci Rep. 2016; 6. https://doi.org/10.1038/srep18881.
 91.Junuthula RR, Xu KS, Devabhaktuni VK. Evaluating link prediction accuracy in dynamic networks with added and removed edges. In: 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom)(BDCloudSocialComSustainCom). IEEE: 2016. p. 377–84. https://doi.org/10.1109/bdcloudsocialcomsustaincom.2016.63.
 92.Chung F. The heat kernel as the pagerank of a graph. Proc Natl Acad Sci. 2007; 104(50):19735–40.CrossRefGoogle Scholar
 93.Leiserson MD, Vandin F, Wu H. T., Dobson JR, Eldridge JV, Thomas JL, Papoutsaki A, Kim Y, Niu B, McLellan M, et al.Pancancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat Genet. 2015; 47(2):106.PubMedPubMedCentralCrossRefGoogle Scholar
 94.Vandin F, Upfal E, Raphael BJ. Algorithms for detecting significantly mutated pathways in cancer. J Comput Biol. 2011; 18(3):507–22.PubMedCrossRefGoogle Scholar
 95.Simes RJ. An improved bonferroni procedure for multiple tests of significance. Biometrika. 1986; 73(3):751–4.CrossRefGoogle Scholar
 96.Amar D, Izraeli S, Shamir R. Utilizing somatic mutation data from numerous studies for cancer research: proof of concept and applications. Oncogene. 2017; 36(24):3375.PubMedPubMedCentralCrossRefGoogle Scholar
 97.Wood LD, Parsons DW, Jones S, Lin J, Sjöblom T, Leary RJ, Shen D, Boca SM, Barber T, Ptak J, et al.The genomic landscapes of human breast and colorectal cancers. Science. 2007; 318:1108–13.PubMedPubMedCentralCrossRefGoogle Scholar
 98.Parsons DW, Jones S, Zhang X, Lin JCH, Leary RJ, Angenendt P, Mankoo P, Carter H, Siu IM, Gallia GL, et al.An integrated genomic analysis of human glioblastoma multiforme. Science. 2008; 321:1807–1812.PubMedPubMedCentralCrossRefGoogle Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.