A novel algorithm based on bi-random walks to identify disease-related lncRNAs
There is evidence to suggest that lncRNAs are associated with distinct and diverse biological processes. The dysfunction or mutation of lncRNAs are implicated in a wide range of diseases. An accurate computational model can benefit the diagnosis of diseases and help us to gain a better understanding of the molecular mechanism. Although many related algorithms have been proposed, there is still much room to improve the accuracy of the algorithm.
We developed a novel algorithm, BiWalkLDA, to predict disease-related lncRNAs in three real datasets, which have 528 lncRNAs, 545 diseases and 1216 interactions in total. To compare performance with other algorithms, the leave-one-out validation test was performed for BiWalkLDA and three other existing algorithms, SIMCLDA, LDAP and LRLSLDA. Additional tests were carefully designed to analyze the parameter effects such as α, β, l and r, which could help user to select the best choice of these parameters in their own application. In a case study of prostate cancer, eight out of the top-ten disease-related lncRNAs reported by BiWalkLDA were previously confirmed in literatures.
In this paper, we develop an algorithm, BiWalkLDA, to predict lncRNA-disease association by using bi-random walks. It constructs a lncRNA-disease network by integrating interaction profile and gene ontology information. Solving cold-start problem by using neighbors’ interaction profile information. Then, bi-random walks was applied to three real biological datasets. Results show that our method outperforms other algorithms in predicting lncRNA-disease association in terms of both accuracy and specificity.
KeywordsLncRNA-disease association Bi-random walks Gene ontology Interaction profile
Inductive matrix completion
Leave-one-out cross validation
It suggests that only 1.5% of genes in the human genome were protein-coding genes, which are twice as many as that of worm and fruit fly . However, 74.7% of the human genome is involved in the process of primary transcripts . It implies that non-coding RNAs play major roles in the regulation of gene expression. The presence or absence of some non-coding RNAs could down- or up-regulate a cascade of gene expression, which could be drug targets for medical therapy of a disease. Many researchers put efforts in to the discovery of the long non-coding RNAs function. Recent studies have found strong association between lncRNA and diseases. It shows that many lncRNAs play as some functional roles in diverse biological processes, such as cell proliferation, RNA binding complexes, immune surveillance, neuronal processes, morphogenesis and gametogenesis . Their dysfunction may cause various diseases. For example, HOTAIR would induce androgen-independent (AR) activation, which plays a central role in establishing an oncogenic cascade that drives prostate cancer progression. It is also a causal reason for AR-mediated transcription programs in the absence of androgen . Therefore, the prediction of lncRNA function would give us a new way to understand the regulation mechanism and disease pathology. There is an urgent demand for the development of fast and accurate algorithm to predict lncRNA-disease association.
Many computational tools have recently been developed to predict potential lncRNA-disease association and functional patters in biological networks [5, 6, 7, 8, 9, 10]. Functional patterns in biological networks. These computational methods are majorly in three categories. One of them is based on the idea of matrix factorization. Matrix factorization can be seen as a linear model of latent factors. In these methods, a corresponding latent factor is generated for each lncRNA and disease. Then, it uses a dot product of the latent factors to represent their similarity. The objective function of matrix factorization is to learn the optimal latent factors which can minimize the prediction error. Recently, these methods have been widely used in the prediction of lncRNA-disease relationship. For example, MFLDA reduces the high dimension of heterogeneous data sources into low-rank matrices via matrix tri-factorization, which can help to explore and exploit their intrinsic and shared structure . SIMCLDA translates the lncRNA-disease association prediction problem into a recommendation, which can be solved with inductive matrix completion (IMC) . However, matrix factorization may also bear the risk of over-fitting and the problem of costing-time complexity. Another type of methods is based on the idea of "guilt-by-associate". They are intuitively guided by the assumption that similar disease or lncRNA have similar connection patterns. If disease (A) and lncRNA (A) are known to be related, and disease (A) and disease (B) are very similar. We can infer disease (B) may also related to lncRNA (A). Obviously, the performance of these algorithms heavily depends on the accuracy of the similarity measures. Many "guilt-by-association" algorithms have been proposed. For example, RWRlncD infers potential human lncRNA-disease associations by implementing the random walk with restart method on a lncRNA functional similarity network . IRWRLDA predicts novel lncRNA-disease associations by integrating known lncRNA-disease associations, disease semantic similarity, and various lncRNA similarity measures and make prediction based on improved Random Walk with Restart . The third type of methods focus on classification. Feature extraction was performed on the complex network. Binary classifiers could be applied in the following step to predict whether there exists a connection between lncRNAs and diseases. Another typical prediction algorithm is LRLSLDA, which constructs a cost function in lncRNA and disease space and makes prediction by combining several classifiers in the lncRNA and disease space into a single classifier . LDAP predicts potential lncRNA-disease associations by using a bagging SVM classifier based on lncRNA similarity and disease similarity .
In this paper, we proposed a novel algorithm, BiWalkLDA, to predict potential lncRNA-disease associations. The design of BiwalkLDA was intuitivly guided by the assumption of "guilt-by-associate". In order to construct more accurate similarity network, we integrate two types of data from interaction profiles and gene ontology. Furthermore, our method was designed to solve the cold-start problem. BiWalkLDA uses bi-random walks algorithm to predict lncRNA-disease association base on a similarity network we constructed. The experiments were carried out on three real datasets downloaded from the LncRNADisease database . Algorithm performance were evaluated by using Leave-one-out cross validation (LOOCV). Results show that BiWalkLDA outperforms other four state-of-art algorithms, meanwhile it is robust on different datasets and parameters in predicting novel lncRNA-disease associations.
Construction of disease similarity networks
Here α is a hyperparameter that control the proportion of SGKD and SGO. If α=1, disease similarity only be calculated base on gene ontology information. If α=0, disease similarity only be calculated base on known disease-lncRNA associations. When the matrix is sparse, it would be better to give a large α so that similarity rewards can be obtained from geneontology. This technique makes the algorithm more robust
Construction of lncRNA similarity network
Calculation of interaction profiles for new lncRNAs
The algorithm of Bi-random walk
Data and materials
Detailed information for three datasets
No. of lncRNA
No. of disease
No. of interaction
We use leave-one-out cross validation (LOOCV) to test the performance of BiwalkLDA. LOOCV is a widely-used strategy to evaluate the quality of the algorithms. In each turn, one known association was set as a test sample. All other lncRNA-disease association were set to training set to train model. All associations that are not observed will be considered as a candidate set and will be scored by BiwalkLDA. A correspond rankList can be generated based on the predicted results. Then true positive rates (TPR, sensitivity) and false positive rates (FPR, 1-specificity) can be calculated by giving different thresholds. Based on the calculated values of TPR and FPR, the receiver-operating characteristics (ROC) curves can be plotted. Then we use the areas under ROC curve (AUC) as evaluation criteria of algorithmic performance which reflects the global prediction accuracy in different situation. The value of AUC closed to one means a perfect prediction, while the AUC value of 0.5 indicates purely random performance.
The effects of parameters
The effects of α
The effects of β
The effects of l and r
The effects of parameters l and r in dataset1
r = 1
r = 2
r = 3
r = 5
r = 6
l = 1
l = 2
l = 3
l = 4
l = 5
l = 6
l = 7
Comparison with other algorithms
De novo lncRNA-disease prediction
Top ten reported lncRNAs for prostate cancer
Name of lncRNA
Many recent studies suggest that lncRNAs are strongly associated with various complex human diseases and they play important roles in the gene expression regulation and post-transcription modification. Predicting lncRNA-disease association can help understand the biological mechanism of disease and reduce the cost of experimental verification. However, discovering the relationship between lncRNA and disease by means of computational model is still a very challenging problem. Therefore, the development of computational tools is much in demand. Although many computational models have been proposed. Their prediction accuracy still has a lot of room to improve. To improve the performance of existing algorithms, we present a novel algorithm, BiwalkLDA based on bi-random walks for the prediction of lncRNA-disease associations. It integrates gene ontology and interaction profile data together to calculate disease similarity, to solve the cold-start problem by using the local structure of lncRNAs neighbors information. Four the-state-of-art computational methods and BiwalkLDA are applied to predict lncRNA-disease associations on three different datasets. Results show that BiwalkLDA is superior to every other existing algorithms in terms of both accuracy and recall. There are still many problems to be dealt with. Existing models are based on small-scale datasets. Although algorithms can achieve high accuracy, their results are often repetitive. If the dataset is too large, the existing algorithms can not be applied to large-scale data. In future work, we will consider to develop more effective algorithm to solve this problem.
Many thanks go to Dr. Bolin Chen and Dr. Jiajie Peng for discussion.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 20 Supplement 18, 2019: Selected articles from the Biological Ontologies and Knowledge bases workshop 2018. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-20-supplement-18.
JH designed the computational framework, YG, JL, YZ, and JW performed all the analyses of the data and wrote the manuscript; XS is the major coordinator, who contributed a lot of time and efforts in the discussion of this project. All authors read and approved the final manuscript.
Publication costs were funded by the National Natural Science Foundation of China (Grant No. 61702420); This project has also been funded by the National Natural Science Foundation of China (Grant No. 61332014, 61702420 and 61772426); the China Postdoctoral Science Foundation (Grant No. 2017M613203); the Natural Science Foundation of Shaanxi Province (Grant No. 2017JQ6037); the Fundamental Research Funds for the Central Universities (Grant No. 3102018zy032); the Top International University Visiting Program for Outstanding Young Scholars of Northwestern Polytechnical University.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
- 1.Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al.Initial sequencing and analysis of the human genome. Nature. 2001; 3(6822):346.Google Scholar
- 9.Hu J, Wang J, Li J, Lin J, Liu T, Zhong Y, Liu J, Zheng Y, Gao Y, He J, Shang X. MD-SVM: A novel SVM-based algorithm for the motif discovery of transcription factor binding sites. BMC Bioinformatics. 2019; 20(S7). https://doi.org/10.1186/s12859-019-2735-3.
- 10.Peng J, Guan J, Shang X. Predicting Parkinson’s disease genes based on node2vec and autoencoder. Front Genet. 2019; 10. https://doi.org/10.3389/fgene.2019.00226.
- 21.Chen X, Huang YA, You ZH, Yan GY, Wang XS. A novel approach based on katz measure to predict associations of human microbiota with non-infectious diseases. Bioinformatics. 2016; 33(5):733–9.Google Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.