LncDisAP: a computation model for LncRNA-disease association prediction based on multiple biological datasets
- 158 Downloads
Over the past decades, a large number of long non-coding RNAs (lncRNAs) have been identified. Growing evidence has indicated that the mutation and dysregulation of lncRNAs play a critical role in the development of many complex human diseases. Consequently, identifying potential disease-related lncRNAs is an effective means to improve the quality of disease diagnostics and treatment, which is the motivation of this work. Here, we propose a computational model (LncDisAP) for potential disease-related lncRNA identification based on multiple biological datasets. First, the associations between lncRNA and different data sources are collected from different databases. With these data sources as dimensions, we calculate the functional associations between lncRNAs by the recommendation strategy of collaborative filtering. Subsequently, a disease-associated lncRNA functional network is built with functional similarities between lncRNAs as the weight. Ultimately, potential disease-related lncRNAs can be identified based on ranked scores derived by random walking with restart (RWR). Then, training sets and testing sets are extracted from two different versions of a disease-lncRNA dataset to assess the performance of LncDisAP on 54 diseases.
A lncRNA functional network is built based on the proposed computational model, and it contains 66,060 associations among 364 lncRNAs associated with 182 diseases in total. We extract 218 known disease-lncRNA pairs associated with 54 diseases to assess the network. As a result, the average AUC (area under the receiver operating characteristic curve) of LncDisAP is 78.08%.
In this article, a computational model integrating multiple lncRNA-related biological datasets is proposed for identifying potential disease-related lncRNAs. The result shows that LncDisAP is successful in predicting novel disease-related lncRNA signatures. In addition, with several common cancers taken as case studies, we found some unknown lncRNAs that could be associated with these diseases through our network. These results suggest that this method can be helpful in improving the quality for disease diagnostics and treatment.
KeywordsLong non-coding RNAs Disease lncRNA network Random walking with restart
Long non-coding RNAs (lncRNAs), which compose the largest portion of the mammalian non-coding transcriptome , are emerging as important regulators of tissue physiology and disease processes . lncRNAs are expressed in a more tissue-specific fashion than mRNA genes  and are highly specific to cell type‚ organs‚ and species . A large amount of lncRNAs have been demonstrated to have a close relationship with many complex human diseases [5, 6, 7, 8]. Therefore, an increasing recognition of the roles of lncRNAs in human disease has created new diagnostic and therapeutic opportunities . The identification of potential lncRNAs related to complex diseases is a hot topic in medicine.
LncRNAs are the key to explaining disease mechanisms. As analysing lncRNAs is very appealing to researchers, many researchers have devoted their work to lncRNAs for exploring complex human diseases at the molecular level. For example, BCYRN1 has been demonstrated to induce the proliferation and migration of non-small cell lung cancer (NSCLC) cells and play an important role in NSCLC progression . LncRNA SNHG1 regulates NOB1 expression by sponging miR-326 and promotes tumourigenesis in osteosarcoma . Ye et al. found that LINC00460 promotes the progression of lung adenocarcinoma by competitively binding miR-302c-5p and regulating the FOXA1 signalling pathway . F. Aksoy et al. postulated that the overexpression of lncRNA DANCR may be associated with poor outcomes in upper rectal cancer . LncRNA HOTAIR plays a role as an oncogenic molecule in different cancers, including breast, gastric, colorectal and cervical cancer cells . Similarly, lncRNA MALAT1 is considered a potential biomarker for the diagnosis and prediction of cancers and may also serve as a therapeutic target for the treatment of specific tumours . In 2018, Chen C et al. deduced that the expression of lncRNA ZEB1-AS1 might be used as a promising prognostic biomarker for cancer . The above studies show that lncRNAs have been recently regarded as possible biomarkers for disease.
Although a large number of lncRNAs have been recorded in public databases, such as GENCODE , NONCODE , LNCipedia , only a few lncRNAs have been characterized functionally . Several methods have been developed to predict potential lncRNA-disease associations [21, 22]. However, they take into account only disease semantic similarity and ignore disease functional similarity. Improved knowledge has suggested that exploring both the semantic and functional associations of diseases, which are two types of significant associations, are beneficial in measuring disease similarity because not all associations between diseases are represented by the disease ontology, and many of them are reflected through the functional associations among disease-related genes . Moreover, the lack of unified identifications for lncRNAs leads to an underutilization of information from different public lncRNA databases when lncRNA functional annotations are approached. Therefore, we aimed to identify more lncRNAs by efficiently analysing the lncRNA and disease data. First, we extracted and utilized functional information related to lncRNAs, including disease similarity, protein-protein interactions and lncRNA-mRNA associations. Subsequently, we established functional associations between lncRNAs and built a disease-related lncRNA network. Potential disease-related lncRNA signatures were predicted by a random walking with restart (RWR).
Materials and methods
DO  database is focused on representing a common and rare disease concept, which aims to provide an open source ontology for the integration of biomedical data associated with human disease. Each node in DO represents one disease term. All of these nodes are organized in a directed acyclic graph (DAG) with an ‘IS_A’ relationship. MEDIC , as a part of the Comparative Toxicogenomics Database (CTD) , integrates Online Mendelian Inheritance in Man (OMIM) terms, synonyms and identifiers with MeSH  terms, synonyms, definitions, identifiers and hierarchical relationships. It is composed of 9700 unique diseases described by more than 67,000 terms. In this study, we map lncRNA-related diseases to DO, utilizing terms and synonyms from DO and MEDIC.
RNAcentral  is a database of non-coding RNA (ncRNA) sequences that aggregates data from specialized ncRNA resources. It assigns unique identifiers to every distinct RNA sequence. Because there is no uniform identity number in the different lncRNA databases, we use identifiers from RNAcentral as unified labels of lncRNAs to ensure the smooth progress of this work.
Human lncRNA-disease association data
LncRNADisease  is a database that curated the experimentally supported lncRNA-disease association data. Presently, there are three versions available. The 2017 version of the LncRNADisease database integrated 2947 lncRNA-disease entries, including 888 lncRNAs and 328 diseases, while the 2015 version covered 1102 lncRNA-disease entries, including 373 lncRNAs and 252 diseases. The newest version  was released in 2018, containing 5714 lncRNAs and 423 diseases. Here, we extract associations between lncRNAs and diseases from this database and use the differences between its versions to validate the reliability of LncDisAP.
Human protein-protein interaction data
STRING  is a database of known and predicted protein-protein interactions. These interactions in STRING include direct (physical) interactions, as well as indirect (functional) interactions, which stem from computational prediction, knowledge transfer between organisms, and interactions aggregated from other databases. The STRING database currently covers 9,643,763 proteins from 2031 organisms. Here, protein-protein interactions from STRING are involved in the lncRNA similarity computation.
Human lncRNA interaction data
starBase v2.0  systematically identified the RNA-RNA and protein-RNA interaction networks from 108 CLIP-Seq data sets generated by 37 independent studies, which provided 423,966 miRNA-mRNA, 10,212 miRNA-lncRNA and 17,609 protein-lncRNA experimentally confirmed interactions based on large scale CLIP-Seq data. The HPRD  represents an mRNA-mRNA interaction network for humans. All the information in HPRD has been manually extracted from the literature by expert biologists. Currently, HPRD covers 39,240 mRNA-mRNA interactions with 9465 mRNA.
LncRNA functional similarity calculation
The differences in different data sets bring some difficulties to the integration of lncRNA data. Two problems must be solved before constructing the lncRNA functional association network. One is the mapping of disease terms. MEDIC and DO are both comprehensive disease corpuses and contain abundant disease terms, so we can annotate DO entries with the vocabulary from MEDIC and create a combined vocabulary of disease terms. Referring to this vocabulary, we build mappings between the DO terms and the disease terms of LncRNADisease. The other problem that must be addressed is the unification of lncRNA identifications. As mentioned above, the lncRNA naming rules of different lncRNA databases are different. Therefore, we employ the RNAcentral id as the unified identification system of lncRNAs considering that the RNAcentral database provides mapping data among various public lncRNA databases.
LncRNA-related disease similarity
Vector model construction for lncRNAs
LncRNA functional similarity
Identifying novel candidate disease-related lncRNAs
To identify novel candidate disease-related lncRNAs, we employ RWR to fully exploit the global functional associations between lncRNAs in this network. RWR, as a global optimization method, can reveal more information between one lncRNA and all the others in the network. The random walker in the network starts from the root node and moves to adjacent nodes with the probabilities from that node to the others. After enough iterations, the probabilities from the root node to all the other nodes will become stable, which can be used as scores for predicting novel disease-related lncRNAs (see  for RWR details). Finally, rankings for each lncRNA in this network can be listed by RWR.
LncRNAs and diseases
We obtained 3,801,586 associations among 4703 disease terms from DO based on disease similarity calculations. Meanwhile, we found 1083 relationships between 184 diseases and 374 lncRNAs by mapping DO terms to the diseases in LncRNADisease (released in July 2017). There were 5,600,133 relationships between 13,716 mRNA and 1034 lncRNAs extracted from starBase v2.0. We found 15,622 associations between 33 proteins and 2750 lncRNAs from starBase v2.0 and STRING.
We calculated similarity among 374 lncRNAs and removed lncRNA pairs that had a similarity of 0. Finally, we built a lncRNA functional network, which contains 66,060 associations among 364 lncRNAs associated with 182 diseases.
Information on lncRNAs associated with cholangiocarcinoma
The impact of data sources and test sets
The test result based on different versions of the data source
lung benign neoplasm
squamous cell carcinoma
large intestine cancer
non-small cell lung carcinoma
Average value of AUC
LncRNA expression similarity
In this article, a computational model for potential disease-related lncRNA identification was proposed based on multiple biological datasets. The results showed that LncDisAP was proven to be successful in predicting novel disease-related lncRNA signatures with an average AUC value of 78.08% and can be an effective solution to improve the quality of disease diagnostics and treatments. To further evaluate the performance of our computational model, we used several common cancers as case studies. We found some unknown lncRNAs that could be associated with these diseases through our network. In addition, we discussed the impact of different data sources and different test sets on the performance of the disease-related lncRNA functional network in predicting disease-lncRNA pairs.
Tianyi Zang and Yadong Wang are the corresponding authors. We thank them for their guidance, Ling Wang, Rongjie Wang, Yanshuo Chu, Zhenxing Wang for their valuable suggestions on our work.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 20 Supplement 16, 2019: Selected articles from the IEEE BIBM International Conference on Bioinformatics & Biomedicine (BIBM) 2018: bioinformatics and systems biology. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-20-supplement-16.
WYT and JL did data collection and preprocessing. And with the guidance of ZT and WYD, WYT finished the algorithm design and validation. WYT and PJ was the major contributors in writing the manuscript. All authors have read and approved the final version of the manuscript.
Publication costs were funded by the Major State Research Development Program of China [Grant No.: 2016YFC0901605, 2016YFC1201702–01], the National Natural Science Foundation of China [Grant No.: 61571152, 31601072], the National High-tech R&D Program of China (863 Program) [Grant No.: 2012AA02A601, 2015AA020101, 2015AA020108].
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
- 4.Quan N, Carninci P. Expression specificity of disease-associated lncRNAs: toward personalized medicine. Curr Top Microbiol Immunol. 2015;394:237.Google Scholar
- 13.Aksoy F, Aksoy S, Tunca B, Işik O, Ozturk E, Yilmazlar T, Yerci O, Egeli U, Cecener G: The clinical significance of lncRNA DANCR in upper rectal adenocarcinoma. In: Annals of Oncology: 2018.Google Scholar
- 14.Hajjari M, Salavaty A. HOTAIR: an oncogenic long non-coding RNA in different cancers. Cancer Biology & Medicine. 2015;12(1):1–9.Google Scholar
- 15.Zhao M, Wang S, Li Q, Ji Q, Guo P, Liu X. MALAT1: a long non-coding RNA highly associated with human cancers (review). Oncol Lett. 2018.Google Scholar
- 16.Chen C, Feng Y, Wang X. LncRNA ZEB1-AS1 expression in cancer prognosis: review and meta-analysis. Clin Chim Acta. 2018.Google Scholar
- 18.Fang SS, Zhang LL, Guo JC, Niu YW, Wu Y, Li H, Zhao LH, Li XY, Teng XY, Sun XH: NONCODEV5: a comprehensive annotation database for long non-coding RNAs. Nucleic Acids Research 2017, 46(Database issue).Google Scholar
- 26.Kibbe WA, Arze C, Felix V, Mitraka E, Bolton E, Fu G, Mungall CJ, Binder JX, Malone J, Vasant D. Disease ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data. Nucleic Acids Res. 2015;43(Database issue):1071–8.CrossRefGoogle Scholar
- 28.Chen G, Wang Z, Wang D, Qiu C, Liu M, Chen X, Zhang Q, Yan G, Cui Q. LncRNADisease: a database for long-non-coding RNA-associated diseases. Nucleic Acids Res. 2013;41(Database issue):983–6.Google Scholar
- 29.Library WP. Human protein reference database; 2009.Google Scholar
- 31.Library WE. Medical subject headings (MeSH). Bull Med Libr Assoc. 2000;88(3):265–6.Google Scholar
- 33.Bao Z, Yang Z, Huang Z, Zhou Y, Cui Q, Dong D: LncRNADisease 2.0: an updated database of long non-coding RNA-associated diseases. Nucleic acids research 2018.Google Scholar
- 34.Wang Y, Juan L, Chu Y, Wang R, Zang T, Wang Y: FNSemSim: an improved disease similarity method based on network fusion. In: IEEE International Conference on Bioinformatics and Biomedicine: 2017. 630–633.Google Scholar
- 35.Tong H, Faloutsos C, Pan J-Y: Fast random walk with restart and its applications. In: Icdm 2006: Sixth International Conference on Data Mining, Proceedings. Edited by Clifton CW, Zhong N, Liu JM, Wah BW, Wu XD; 2006: 613−+.Google Scholar
- 39.Tang J, Li Y, Sang Y, Yu B, Lv D, Zhang W, Feng H. LncRNA PVT1 regulates triple-negative breast cancer through KLF5/beta-catenin signaling. Oncogene. 2018.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.