Abstract
DNA methylation is one important epigenetic type to play a vital role in many diseases including cancers. With the development of the high-throughput sequencing technology, there is much progress to disclose the relations of DNA methylation with diseases. However, the analyses of DNA methylation data are challenging due to the missing values caused by the limitations of current techniques. While many methods have been developed to impute the missing values, these methods are mostly based on the correlations between individual samples, and thus are limited for the abnormal samples in cancers. In this study, we present a novel transfer learning based neural network to impute missing DNA methylation data, namely the TDimpute-DNAmeth method. The method learns common relations between DNA methylation from pan-cancer samples, and then fine-tunes the learned relations over each specific cancer type for imputing the missing data. Tested on 16 cancer datasets, our method was shown to outperform other commonly-used methods. Further analyses indicated that DNA methylation is related to cancer survival and thus can be used as a biomarker of cancer prognosis.
Similar content being viewed by others
References
Francis R C. Epigenetics: The Ultimate Mystery of Inheritance. WW Norton & Company, 2011.
Ye P, Luan Y, Chen K, Liu Y, Xiao C, Xie Z. MethSMRT: An integrative database for DNA N6-methyladenine and N4-methylcytosine generated by single-molecular real-time sequencing. Nucleic Acids Research, 2016, 45(D1): D85-D89. DOI: https://doi.org/10.1093/nar/gkw950.
Kulis M, Esteller M. DNA methylation and cancer. Advances in Genetics, 2010, 70(22): 27-56. DOI: https://doi.org/10.1016/B978-0-12-380866-0.60002-2.
Gerd P. Defining driver DNA methylation changes in human cancer. International Journal of Molecular Sciences, 2018, 19(4): Article No. 1166. DOI: 10.3390/ijms19041166.
Jouinot A, Assie G, Libe R et al. DNA methylation is an independent prognostic marker of survival in adrenocortical cancer. The Journal of Clinical Endocrinology & Metabolism, 2016, 102(3): 923-932. DOI: https://doi.org/10.1210/jc.2016-3205.
Zhang G, Huang K C, Xu Z et al. Across-platform imputation of DNA methylation levels incorporating nonlocal information using penalized functional regression. Genetic Epidemiology, 2016, 40(4): 333-340. DOI: https://doi.org/10.1002/gepi.21969.
Troyanskaya O, Cantor M, Sherlock G et al. Missing value estimation methods for DNA microarrays. Bioinformatics, 2001, 17(6): 520-525. DOI: https://doi.org/10.1093/bioinformatics/17.6.520.
Guttorp P, Fuentes M, Sampson P. Using transforms to analyze space-time processes. In Statistical Methods for Spatio-Temporal Systems, Finkenstadt B, Held L, Isham V (eds.), CRC/Chapman, 2006, pp.77-150.
Josse J, Husson F. Handling missing values in exploratory multivariate data analysis methods. Journal de la Société Française de Statistique, 2012, 153(2): 77-99.
Di Lena P, Sala C, Prodi A, Nardini C. Missing value estimation methods for DNA methylation data. Bioinformatics, 2019, 35(19): 3786-3793. DOI: https://doi.org/10.1093/bioinformatics/btz134.
Stekhoven D J, Bühlmann P. MissForest-Non-Parametric missing value imputation for mixed-type data. Bioinformatics, 2012, 28(1): 112-118. DOI: https://doi.org/10.1093/bioinformatics/btr597.
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436-444. DOI: https://doi.org/10.1038/nature14539.
Heffernan R, Paliwal K, Lyons J et al. Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning. Scientific Reports, 2015, 5: Article No. 11476. DOI: 10.1038/srep11476.
Chen J, Zheng S, Zhao H, Yang Y. Structure-aware protein solubility prediction from sequence through graph convolutional network and predicted contact map. Journal of Cheminformatics, 2021, 13(1): Article No. 7. DOI: 10.1186/s13321-021-00488-1.
Senior A W, Evans R, Jumper J et al. Improved protein structure prediction using potentials from deep learning. Nature, 2020, 577(7792): 706-710. DOI: https://doi.org/10.1038/s41586-019-1923-7.
Ching T, Himmelstein D S, Beaulieu-Jones B K et al. Opportunities and obstacles for deep learning in biology and medicine. Journal of the Royal Society Interface, 2018, 15(141): Article No. 20170387. DOI: 10.1098/rsif.2017.0387.
Zheng S, Li Y, Chen S, Xu J, Yang Y. Predicting drugprotein interaction using quasi-visual question answering system. Nature Machine Intelligence, 2020, 2(2): 134-140. DOI: https://doi.org/10.1038/s42256-020-0152-y.
Zheng S, Rao J, Zhang Z, Xu J, Yang Y. Predicting retrosynthetic reactions using self-corrected transformer neural networks. Journal of Chemical Information and Modeling, 2019, 60(1): 47-55. DOI: https://doi.org/10.1021/acs.jcim.9b00949.
Way G P, Greene C S. Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pac Symp Biocomput, 2018, 23: 80-91. DOI: https://doi.org/10.1101/174474.
Titus A J, Wilkins O M, Bobak C A, Christensen B C. Unsupervised deep learning with variational autoencoders applied to breast tumor genome-wide DNA methylation data with biologic feature extraction. https://www.biorxiv.org/content/10.1101/433763v5, Dec. 2021. DOI: 10.1101/433763.
Lv X, Chen Z, Lu Y, Yang Y. An end-to-end Oxford Nanopore basecaller using convolution-augmented transformer. In Proc. the 2020 IEEE International Conference on Bioinformatics and Biomedicine, Dec. 2020, pp.337-342. DOI: 10.1109/BIBM49941.2020.9313290.
Tian T, Wan J, Song Q, Wei Z. Clustering single-cell RNA-seq data with a model-based deep learning approach. Nature Machine Intelligence, 2019, 1(4): 191-198. DOI: https://doi.org/10.1038/s42256-019-0037-0.
Lopez R, Regier J, Cole M B, Jordan M I, Yosef N. Deep generative modeling for single-cell transcriptomics. Nature Methods, 2018, 15(12): 1053-1058. DOI: https://doi.org/10.1038/s41592-018-0229-2.
Zeng Y, Zhou X, Rao J, Lu Y, Yang Y. Accurately clustering single-cell RNA-seq data by capturing structural relations between cells through graph convolutional network. In Proc. the 2020 IEEE International Conference on Bioinformatics and Biomedicine, Dec. 2020, pp.519-522. DOI: 10.1109/BIBM49941.2020.9313569.
Zhou X, Chai H, Zeng Y, Zhao H, Luo C H, Yang Y. scAdapt: Virtual adversarial domain adaptation network for single cell RNA-seq data classification across platforms and species. Briefings in Bioinformatics, 2021, 22(6): Article No. bbab281. DOI: 10.1093/bib/bbab281.
Zhang Z, Zhao Y, Liao X et al. Deep learning in omics: A survey and guideline. Briefings in Functional Genomics, 2019, 18(1): 41-57. DOI: https://doi.org/10.1093/bfgp/ely030.
The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature, 2020, 578(7793): 82-93. DOI: https://doi.org/10.1038/s41586-020-1969-6.
Li Y, Wang L, Wang J, Ye J, Reddy C K. Transfer learning for survival analysis via efficient L2, 1-Norm regularized cox regression. In Proc. the 2016 IEEE International Conference on Data Mining, Dec. 2016, pp.231-240. DOI: https://doi.org/10.1109/ICDM.2016.0034.
Yousefi S, Amrollahi F, Amgad M et al. Predicting clinical outcomes from large scale cancer genomic profiles with deep survival models. Scientific Reports, 2017, 7(1): Article No. 11707. DOI: 10.1038/s41598-017-11817-6.
Yang X, Gao L, Zhang S. Comparative pan-cancer DNA methylation analysis reveals cancer common and specific patterns. Briefings in Bioinformatics, 2016, 18(5): 761-773. DOI: https://doi.org/10.1093/bib/bbw063.
Hoadley K A, Yau C, Wolf D M et al. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell, 2014, 158(4): 929-944. DOI: https://doi.org/10.1016/j.cell.2014.06.049.
Zhou X, Chai H, Zhao H, Luo C H, Yang Y. Imputing missing RNA-sequencing data from DNA methylation by using a transfer learning-based neural network. Giga-Science, 2020, 9(7): Article No. giaa076. DOI: 10.1093/gigascience/giaa076.
Wei L, Jin Z, Yang S, Xu Y, Zhu Y, Ji Y. TCGAassembler 2: Software pipeline for retrieval and processing of TCGA/CPTAC data. Bioinformatics, 2017, 34(9): 1615-1617. DOI: https://doi.org/10.1093/bioinformatics/btx812.
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 2010, 33(1): 1-22.
Van Belle V, Pelckmans K, Van Huffel S, Suykens J A. Support vector methods for survival analysis: A comparison between ranking and regression approaches. Artificial Intelligence in Medicine, 2011, 53(2): 107-118. DOI: https://doi.org/10.1016/j.artmed.2011.06.006.
Author information
Authors and Affiliations
Corresponding author
Supplementary Information
ESM 1
(PDF 175 kb)
Rights and permissions
About this article
Cite this article
Wang, XF., Zhou, X., Rao, JH. et al. Imputing DNA Methylation by Transferred Learning Based Neural Network. J. Comput. Sci. Technol. 37, 320–329 (2022). https://doi.org/10.1007/s11390-021-1174-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-021-1174-6