Background

One crucial step toward understanding the molecular basis of diseases, such as cancer, diabetes, and cardiovascular disorders, is to identify the predisposing or virulence genes of these diseases, which will lead to early disease diagnosis and effective drug design [1]. With the availability of the big biomedical data, researchers tend to get insights into human diseases by identifying genes that might cause or relate to them. Given the fact that experimentally identifying of the complete list of disease-related genes is generally impractical due to the high cost, computational methods have been proposed in the last decades to predict the relationships between genes and human diseases [210]. However, these tools, including filtering methods based on a set of criteria [11], text mining of biomedical literature [12], integration of genomic data [1315], semantic similarity [1621] based disease gene prioritization [22] and network analysis based and highly robust approach [8, 2326], remain pre-eminent [27].

A human cell consists of several functionally inter-dependent molecular components. A human disease rarely results from an abnormality in a single gene but reflects the perturbations of the complex molecular network induced by different kinds of factors, such as genetic variations, pathogens and epigenetic changes [28]. The molecular network links molecular states to physiological states associated with diseases in a whole system view [29]. Therefore, network-based approaches may offer better targets for drug development, and may lead to multiple potential biological and clinical applications including disease gene discovery [28].

The network-based approaches for disease gene identification can be loosely grouped into three categories. The simplest approach, named direct neighbor counting, is to check whether two genes are connected directly in a molecular network. The idea is that if a gene is connected to one of the known disease genes, it may be associated with the same disease [30]. The experimental result demonstrates that using molecular networks can effectively increase the likelihood of identifying candidate disease genes. The direct neighbor counting method, however, does not consider the situation that two genes are not connected directly but still have certain biological associations. To address this problem, Kruthammmer et al. employed the shortest path length approach to measure the closeness between a disease gene and a candidate gene. This method has been successfully applied to predict the genes associated Alzheimer’s disease, and the prediction results agree with the manually curated candidates [31]. Since both the direct neighbor counting method and the shortest path method are local distance measurements, they largely ignore the global structure of the whole molecular network and cannot fully capture the complex relationships between network nodes [32]. Subsequently, methods have been proposed to predict the gene-disease relation using the global network structure, such as random walk with restart (RWR) [33], propagation flow [34], Markov clustering [35] and graph partitioning [36]. The performance evaluation on HPRD [37], OPHID [38] and OMIM [39] dataset shows that RWR is the best among the then-existing methods [5].

Rapidly evolving bio-technologies promote collecting multiple types of biological data, including diverse genome-scale data, clinical phenotype data, environment data, and data of daily activities [40], making it feasible and attractive to build integrated biomedical networks from multiple sources, rather than focus on one single data set. The integrated network that includes multiple, heterogeneous types of resources, greatly extends the scope and ability for disease gene prediction [41]. For example, BioGraph [42] uses data from 21 publicly available curated databases to identify relations between heterogeneous biomedical entities. The work by Ganegoda et al. runs RWR on a integrated network, and has successfully identified disease-related genes with significant improved performance [43].

Using integrated networks for gene-disease relationship discovery is still a difficult task due to the existence of multiple biomedical entities in the integrated networks. In a network built using a single type of biomedical data, there is only one type of nodes and one type of edges. For example, in a protein-protein interaction network, nodes and edges represent proteins and protein interactions respectively. The integrated network, on the contrary, contains multiple types of nodes and edges representing different biomedical entities (such as genes, diseases, and ontology terms) and relationships (such as DNA-protein binding and gene ontology annotation). In order to differentiate these edge types, edge weights in the integrated biomedical network should be appropriately assigned [44].

In this article, we present a new algorithm called SLN-SRW (Simplified Laplacian Normalization-Supervised Random Walk) to define edge weights in an integrated network and use the weighted network to predict gene-disease relationships. Comparing with the existing approaches, SLN-SRW has the following advantages:

  • SLN-SRW is the first approach, to the best of our knowledge, to predict gene-disease relationships based on a weighted integrated network with its edge weight being computed to precisely describe the importance of different edge types.

  • The performance of random walk may be strongly affected by the super hub nodes in an integrated network. SLN-SRW adopts a Laplacian normalization based method to avoid such bias.

  • To prepare inputs for SLN-SRW, we constructed a new heterogeneous integrated network based on three widely used biomedical ontologies, i.e. Human Phenotype Ontology [45], Disease Ontology [46], and Gene Ontology [47, 48], and biological databases such as STRING [49]. This integrated network combines biomedical knowledge from ontologies that are manually curated and big biomedical data that have been deposited in databases. Based on these two distinctively different types of information, this network forms a foundation for disease gene discovery.

Methods

We propose SLN-SRW to compute and model the edge weight of an integrated network and then predict disease genes. To achieve the goal, SLN-SRW consists of three steps. First, it integrates knowledge and data from multiple ontologies and databases to construct an integrated network G(V,E), where V is a set of nodes and E is a set of edges that connect the nodes in V. Second, it uses a Laplacian normalization based supervised random walk algorithm to learn the edge weight of network G, resulting in a weighted integrated network G w . Third, it employs the RWR method on G w to predict disease-gene relationships. The diagram of the whole process of SLN-SRW is shown in Fig. 1. We will introduce the key steps of SLN-SRW in the rest of this section.

Fig. 1
figure 1

The Framework of SLNSRW. Framework of SLN-SRW for estimating the edge weight of the integrated network automatically and predicting disease genes based on it. The second step is the essential part of SLN-SRW algorithm

Step 1. Integrating heterogeneous knowledge and data sources for integrated network construction

In the first step of SLN-SRW, an integrated network is constructed based on eleven heterogeneous data sources in four distinct forms, i.e. ontologies, networks, unified vocabularies, and relational databases. The data sources are listed in Table 1, and they are mainly used for relation extraction, name mapping, and unified vocabulary. The data sources can be grouped into two categories: 1) Curated data that were collected from literature and other high quality data sources, such as Search Tool for the Retrieval of Interacting Gene/Proteins (STRING) and Online Mendelian Inheritance in Man (OMIM), and 2) Curated ontologies that constructed manually by the domain expert, such as Gene Ontology (GO) and Disease Ontology (DO).

Table 1 Integrated databases and ontologies. The first column, second column, and third column represent the abbreviation of the data source, simplify the description of the data source and the relationship extracted from the data source respectively. Eleven data sources are used to construct the integrated network. Specific types of nodes and edges are extracted from various data sources and integrated into a network

The workflow for constructing the integrated network out of the heterogeneous data sources is shown in Fig. 2. Specifically, the network construction process has the following four steps:

  • Extracting information from heterogeneous data sources. Ontology parser and database parser have been developed for ontology and database data extraction respectively. The ontology parser processes the OBO file and the ontology annotation file, since HPO, DO and GO are all in Open Biomedical Ontologies (OBO) format. The database parser processes files in Tab Separated Values (TSV), Comma Separated Values (CSV), and Extensible Markup Language (XML) format. The outputs of the two parsers are pair-wise relations and their properties between two biomedical entities.

    Fig. 2
    figure 2

    The workflow of constructing the integrated network. Work flow of constructing the integrated network based on multiple data sources

  • Unifying biomedical entity IDs. The same pair-wise relation may be extracted from multiple data sources with different identifiers. To avoid confusion, we provide a distinct ID number for each biomedical entity by mapping all types of identifiers to the ones provided in the Unified Medical Language System (UMLS). The challenge is that some types of identifiers cannot be direct mapped to UMLS. For example, only a small part (61%) of HPO and DO term can be mapped to UMLS. Therefore, we adopted ClinVar [50] to map all the HPO terms to UMLS, and utilized SIDD [51] to map all the disease names in DO to MeSH ID, provided that there are direct mappings between MeSH ID and UMLS. Please see Additional file 1 for more details. After unifying the entity IDs from multiple data sources, each entity only has one identifier in the database. We removed the identifiers that cannot be mapped to UMLS.

  • Constructing the integrated network. The binary relations extracted from multiple data sources form an integrated network G, in which nodes are biomedical entities (i.e. ontology terms and genes), and edges are the relationships between the entities, which have seven different types: GO term - GO term, GO term - gene, DO term - DO term, DO term - gene, HPO term - HPO term, HPO term - gene, and gene - gene.

  • Edge initial weight assignment. We assign the initial edge weight t(u,v) to every edge <u,v> according to its edge type and the evidence code associated to the edge, where both u and v are nodes in G. Specifically, for the edge types that have edge confidence scores in the source databases, we use the confidence scores directly. For the edge types that do not have confidence scores but are associated with evidence codes, we manually assign initial edge weights based on their edge evidence codes (see Additional file 2 for the manually assigned weights). The edge initial weights are between 0 and 1, and the experimentally verified edges have higher initial weights than computational predictions. For example, an edge between a GO term and a gene with evidence code “EXP” has a high weight (1.0), whereas an edge with “IEA” code has a low weight (0.4), since “EXP” indicates the GO-gene relationship has been experimentally verified while “IEA” means computational prediction. Note that for the edges that have two or more evidences in E, the initial weights are calculated as the maximal weight of all the valid evidence codes.

Step 2. Weighing the importance of different types of edges in integrated network

Given an integrated network G with manually assigned initial edge weights, the aim of this step is to automatically re-assign all the edge weights, such that the weighted network G w can be used for more precise disease gene prediction. To achieve this goal, we develop a new edge weight optimization algorithm based on supervised random walks (SRW) [52]. SRW combines the information from network structure with the node and edge level attributes, which can guide the random walk on the network. By running SRW, we expect to re-assign weights to all the edges, such that the random walker from a disease node is more likely to visit the nodes representing the associated genes. However, the training process of supervised random walks (i.e. RWR) can be significantly affected by the hub nodes in the network. To solve this problem, we propose a Laplacian normalization method to weigh the importance of different types of edges in an integrated network described as follows.

Given an integrated network G(V,E), let node v d V represent a kind of disease and let V g V be the set of the candidate genes of v d , then the disease gene prediction problem can be converted to a problem to predict all the new edges between v d and a subset of nodes in V g , where a critical step is to weigh the edges such that a random walker from v d has higher probabilities to reach the known disease genes in V g than to reach the other genes. To provide the training set for learning the edge weight, we generate a positive set V p and a negative set V n for every disease node v d , where V p includes known disease genes associated with v d and V n includes genes not associated with v d .

The approach to weigh the importance of different edge types consists of the following three steps:

  • Laplacian normalization on edge weights. To avoid the biases caused by the hub nodes in the integrated network, we adopt the Laplacian normalization method [53] to normalize all the edge weights. Given a edge (u,v)∈E, the edge weight of edge (u,v) is normalized by all the edges connecting to node u or node v. Mathematically, the laplacian normalized edge weight a(u,v) is defined as:

    $$ a(u,v) = \frac{f(u,v)}{\sqrt{\sum_{i \in N(u)}f(u,i)\sum_{j \in N(v)} f(v,j)}} $$
    (1)

    where N(x) is the set of neighbors of node x; f(x,y)=1/(1+e w·t(x,y)); w is the edge type importance vector for graph G that we will learn in the next step using an optimization process, and its length is equal to the number of possible edge types (in our case, seven); t(x,y) is the vector of the initial weight of edge <x,y>, which has the same length as w. t(x,y) is all zero except one cell because each edge can have one and only one kind of edge type. Note that the edge type is decided by the type of nodes connected by it. For example, gene - gene and HPO term - gene are two different types of edges in the integrated network. a(u,v) integrates and normalizes both the edge type importance w and the initial edge weight t; it can be used to model the random walk transition probability.

  • Edge weight optimization - problem formation. In order to learn the optimal w for all the seven edge types in an integrated network, we minimize an optimal function defined in Eq. 2, such that the random walker in the network is more likely to reach the genes in V p than the genes in V n .

    $$ {{} {\begin{aligned} w&=\arg\min_{w} O(w) \\&\quad\,=\,\ \arg\min_{w}\left(\frac{1}{2}{||w||}^{2} \,+\, \lambda\sum\limits_{v_{d} \in D}\sum\limits_{v_{p}\in V_{p}, v_{n} \in V_{n}} h\left(S_{v_{n}}\,-\,S_{v_{p}}\right)\right) \end{aligned}}} $$
    (2)

    where ||w|| is the euclidean norm; and D is a set of starting nodes representing the diseases in the training set. For each disease node v d D, V p and V n representing the positive training set and the negative training set respectively. \(S_{v_{p}}\) (\(S_{v_{n}}\)) is the association value between v d and v p V p (v d and v n V n ), which can be calculated by running RWR on G [54]. λ is the weight penalty score deciding to what extent the constraints can be violated. Given the value of \(S_{v_{n}}-S_{v_{p}}\), h() is a loss function that returns a non-negative value:

    $$ h(x)= \left\{ \begin{aligned} 0, ~~~~~~ x < 0 \\ \frac{1}{1+ e^{-\frac{x}{b}}}, ~~~~~~x \geq 0\\ \end{aligned} \right. $$
    (3)

    where b is a constant positive parameter, \(x=S_{v_{n}}-S_{v_{p}}\). The smaller the b is, the more sensitive the loss function is (see Additional file 3). If \(S_{v_{n}}-S_{v_{p}}<0\), the association between a disease and a gene in the positive training set is stronger than the association between the same disease and a gene in the negative training set, so h()=0. Otherwise, the constraint is violated, so h()>0.

  • Edge weight optimization - our solution. To optimize edge type importance parameter w to minimize Eq. 2, we adopt a widely used meta-heuristics method called the gradient based optimization method [20], which has been successfully adopted to solve the link prediction problem in social networks and collaboration networks [52].

    To make the story complete, we briefly describe the gradient-based optimization method in the following text.

    First, we construct a stochastic transition matrix Quv′ of RWR using Eq. 1.

    $$ Q_{uv}'= \left\{ \begin{aligned} \frac{a(u,v)}{\sum_{w}a(u,v)}, ~~~~~~if(u,v) \in E \\ 0, ~~~~~~otherwise\\ \end{aligned} \right. $$
    (4)

    And then, based on the transition matrix Quv′, RWR can be described as:

    $$ Q_{uv} = (1-\alpha)Q_{uv}' + \alpha\mathbf{1}(v=s) $$
    (5)

    where u and v represent two arbitrary nodes in G; α is the restart probability, which is a user given threshold (in this case, we find the best value based on the training data set); and node s is a disease node, which is the starting node of random walk. Let \(p_{i}^{(k)}\) be the probability to reach node i from s after k iterations. The probability vector at the kth iteration can be represented as \(P^{(k)} = (p_{1}^{(k)}, p_{2}^{(k)},..., p_{|V|}^{(k)})^{T}\). The stationary probability vector P, which can be obtained after certain iterations, is the solution of the following equation:

    $$ P^{T} = P^{T}Q $$
    (6)

    The next step is to apply a gradient based method to identify w to minimize O(w) in Eq. 2. The derivative of O(w) can be calculated as follows.

    $$ \begin{aligned} \frac{\partial O(w)}{\partial w} &= 2w + \sum\limits_{v_{n},v_{p}}\frac{\partial h(S_{v_{n}}-S_{v_{p}})}{\partial w} \\&= 2w + \sum\limits_{v_{n},v_{p}}\frac{\partial h(S_{v_{n}}-S_{v_{p}})}{\partial (S_{v_{n}}-S_{v_{p}})}\left(\frac{\partial S_{v_{n}}}{\partial w} - \frac{\partial S_{v_{p}}}{\partial w}\right) \end{aligned} $$
    (7)

    \(\frac {\partial S_{v_{x}}}{\partial w}\) can be calculated as follows:

    $$ \frac{\partial S_{v_{x}}}{\partial w} = \sum\limits_{v_{i}}Q_{v_{i}v_{x}}\frac{\partial S_{v_{i}}}{\partial w}+S_{v_{i}}\frac{\partial Q_{v_{i}v_{x}}}{\partial w} $$
    (8)

    This derivative can be repeatedly computed based on the estimate obtained in the previous iteration. The iteration stops when \(\frac {\partial S_{v_{i}}}{\partial w}\) and \(S_{v_{i}}\) do not change. The initial value of \(\frac {\partial S_{v_{i}}}{\partial w}\) is 0. The \(S_{v_{i}}\) is initialized as \(\frac {1}{|V|}\). The initialization process is the same as the traditional SRW method. \(\frac {\partial Q_{v_{i}v_{x}}}{\partial w}\) can be calculated as follows.

    Particularly, \(\frac {\partial Q_{v_{i}v_{x}}}{\partial w} = 0\), if edge (v i ,v x ) does not exist in the network.

    $$ \begin{aligned} \frac{\partial Q_{v_{i}v_{x}}}{\partial w} = (1-\alpha)\frac{\frac{\partial a(v_{i},v_{x})}{\partial w}\left(\sum_{v_{j}}a(v_{i},v_{j})\right) - a(v_{i},v_{x})\sum_{v_{j}}\frac{\partial a\left(v_{i},v_{j}\right)}{\partial w}} {\left(\sum_{k}a(v_{i},v_{j})\right)^{2}} \end{aligned} $$
    (9)
    $$ \frac{\partial a(v_{i},v_{x})}{\partial w} = \frac{\frac{\partial f(v_{i},v_{x})}{\partial w}\pi(f(v_{i},v_{x})) - f(v_{i},v_{x})\frac{\partial \pi(f(v_{i},v_{x}))}{\partial w}} {\pi(f(v_{i},v_{x}))^{2}} $$
    (10)

    where π(f(v i ,v x )) and \(\frac {\partial \pi (f(v_{i},v_{x}))}{\partial w}\) are:

    $$ \pi(f(v_{i},v_{x})) = \sqrt{\sum_{v_{j} \in N(v_{i})}f(v_{i},v_{j})\sum_{v_{y} \in N(v_{x})}f(v_{x},v_{y})} $$
    (11)
    $$ \begin{aligned} \frac{\partial \pi(f(v_{i},v_{x}))}{\partial w} \,=\, \frac{\sum_{v_{j}\in N(v_{i})}\sum_{v_{y} \in N(v_{x})}\left(\frac{\partial f(v_{j},v_{i})}{\partial w}f\left(v_{y},v_{x}\right) \,+\, \frac{\partial f(v_{y},v_{x})}{\partial w}f(v_{j},v_{i})\right)} {2\sqrt{\sum_{v_{j} \in N(v_{i})}f(v_{j},v_{i})\sum_{v_{y}\in N(v_{x})}f(v_{y},v_{x})}} \end{aligned} $$
    (12)

    where N(v) is the set of neighbors of node v. After we get the solution of Eq. 7, we can apply a gradient descent based method and minimize O(w).

Practically, the process of obtaining w has four steps (Fig. 3). Firstly, we initial the O(w) based on the initial parameters. Secondly, the derivative \(\frac {\partial O(w)}{\partial w}\) is calculated in step 2. The power iteration is applied to calculate \(\frac {\partial S_{v_{i}}}{\partial w}\) and \(\frac {\partial Q_{v_{i}v_{x}}}{\partial w}\) respectively. Thirdly, based on the derivative, we can update the gradient to obtain an updated parameter w. Fourthly, taking the updated w as input, step 4 calculates the stationary probability of the RWR. In the process, the iteration for derivative calculation (step 2 in Fig. 3) and the RWR algorithm (step 4 in Fig. 3) are the two key steps. After estimating the edge weight of the integrated network, we can directly apply RWR on the weighted integrated network to predict the relation between diseases and genes.

Fig. 3
figure 3

The process of training the the parameter w. The steps of training the the parameter w

Results and discussion

We compare SLN-SRW with SRW and RWR, where the latter has been widely used in network-based disease gene prediction, on a real and a synthetic data sets. SLN-SRW was implemented with Java 7 in Linux.

Data preparation

As shown in Table 1, eleven data sources, i.e. STRING [49], CTD [55], OMIM [56], ClinVar [50], HGNC [57], MeSH [58], UMLS [59], SIDD [51], DO [60], HPO [61] and GO [62], are used to construct the integrated network G, which has 78,786 nodes and 504,517 edges.

To test the performance of SLN-SRW, we select 430 disease-gene edges from the integrated network as the positive set. The rules for data selection are similar to the rules used in [42]. In the positive set, there are 16 diseases, each of which has at least five known disease-associated genes in the integrated network. More detail about the positive set is listed in Additional file 4. The disease-gene pairs included in the negative set are generated in two steps. First, we select a disease d from the positive set. Second, we repeatedly and randomly select genes that do not connect to d in the integrated network G. The number of the randomly selected genes is the same as the number of genes that connect to d in the positive set. We repeat the process until all disease nodes in the positive set are elected. Note that the positive set is removed from the integrated network in the testing process. Both positive and negative sets are evenly divided into two parts randomly, one for training and the other for testing.

A synthetic data set is generated following the rules in [52]. Specifically, we generated a scale-free network with 1,000 nodes using the Copying model [63] The generation process starts with three connected nodes. We connect a new node u to any of the existing nodes, which are selected at random with probability 0.8 or with probability proportional to the node degree. Parameter b is equal to 0.03 in all the experiments. For each edge in the network, we set w={1,−1} as the gold standard labeled as w . Then, we randomly choose one of the original three nodes as the start point v. Based on the edge strength determined by w , we run RWR starting from v and ranked the other nodes via the stationary probability. We select the top 20 nodes that directly connect with v as the positive training set, and select the nodes that do not connect with v are the negative set. Note that both the positive set and the negative set are removed from the integrated network in the testing process. In the subsection “Performance evaluation on synthetic data set”, we test whether w can be estimated precisely.

Disease gene prediction

The parameters in SLN-SRW and SRW method are estimated based on the training set. The RWR method does not need the training set for edge weight assignment. Alternatively, the training set is used to estimate the best restart probability in RWR. Finally, the performance of all the three methods is tested based on the testing set.

Varying the restart probability α from 0.1 to 0.9, the AUC (area under receiver operating characteristic curve) scores [64] of all the three methods are shown in Fig. 4. If α=0.2, SLN-SRW method reaches the highest AUC score 0.81, whereas SRW and RWR have the highest AUC scores if α=0.6, indicating that SLN-SRW can find the disease genes which are far from the disease node. Based on the edge weights learned using the training data, we predicted the disease-gene relationships in the testing set. We compared the performance of all the three methods using the receiver operating characteristic (ROC) curve. In our test, the AUC score of SLN-SRW (0.79) is the highest (see Fig. 5). Especially, the true positive rate of SLN-SRW is significantly higher than RWR and SRW while its false positive rate keeps low. This is important for disease gene predict, since researchers usually select candidate disease genes with a stringent threshold, which corresponds to a low false positive rate.

Fig. 4
figure 4

he AUC score for each given restart probability for three methods. The AUC score for each given restart probability for three methods. The red, blue and yellow lines are represent SLN-SRW, SRW and RWR method respectively

Fig. 5
figure 5

ROC curves for the experimental results on testing set. ROC curves for the experimental results on testing set. ROC curves for the experimental results calculated with SLN-SRW (green), SRW (red) and RWR (blue)

Finally, we ranked the predicted disease genes to check whether the true disease-related genes have higher ranks than the other genes. Figure 6 shows that the prediction result of SLN-SRW contains more known disease-related genes than SRW and RWR at a majority of the top k levels, indicating that the edge weighing process in SLN-SRW has contributed significantly to the high recall of the results.

Fig. 6
figure 6

True disease-gene pair rates. True disease-gene pair rates at different top k levels

Performance evaluation on synthetic data set

To compare SLN-SRW with SRW, we ran both methods on synthetic data, following the method described in below [52]. For SRW and SLN-SRW, we estimated the edge-type parameter based on the synthetic network and the training set described in the “ Data preparation ” subsection, resulting in w . We measure the performance of SRW and SLN-SRW by comparing the true edge-type parameter w with w , using \(error = \sum _{i}|w'_{i} - w^{*}_{i}|\). After repeating the experiment 100 times, we find that the error of SLN-SRW is statistically significantly lower than that of SRW (t-test pvalue<0.05) indicating that SLN-SRW performs better than SRW (see Fig. 7). The error of SLN-SRW is also lower in the first and third quartile.

Fig. 7
figure 7

The boxplot of the error score. The boxplot of the error score for SLN-SRW and SRW

Conclusions

Identifying the relationships between diseases and genes is vital for disease diagnosis and drug design. Recently, researchers have started to employ integrated biomedical networks to extend the scope and ability for disease gene prediction. In this article, we proposed a novel network-based method named SLN-SRW to define the weight of edges in an integrated network and then use it to predict the gene-disease relationships. SLN-SRW has the following advantages: 1) it can estimate edge weight by differentiating different edge-types; 2) it adopts a Laplacian normalization based method to avoid the bias caused by the super hub nodes in an integrated network; 3) three widely used biomedical ontologies are used to construct a new heterogeneous integrated network. To demonstrate the advantages of SLN-SRW, we compare it with two existing methods SRW and RWR. The experiment on a real data set shows that SLN-SRW performs best among all the three methods. Furthermore, the experiment on a synthetic data set indicates that the edge weights predicted by SLN-SRW are more precise than SRW. Comparing with the existing methods, SLN-SRW has the unique function to identify disease genes, which are not close to any disease node in the disease-gene networks. This could benefit clinicians on discovering new disease-associated genes that have not been identified by the existing methods. Besides, SLN-SRW provides a novel approach to automatically assign weights to the heterogeneous edge types in the disease-gene networks, whereas the existing methods can only define the edge weights manually.

In the future, SLN-SRW will be applied to networks with different edge densities and qualities to test its robustness. Furthermore, we will apply SLN-SRW on more recent datasets and examine the results using both biological experiments and literature.