Abstract
A typical biological study includes analysis of heterogeneous biological databases, e.g., genomics, proteomics, metabolomics, and microarray gene expression. These datasets correlate at the patient-level, e.g., decrease in the workload of a group of genes in body cells increases the work of other group and raises the number of their products. Joint analysis of correlated patient-level data sources improves the final diagnosis. State-of-art biological methods, such as differential expression analysis, do not support heterogeneous data source integration and analysis. Recently, scientists in different computational fields have made significant improvements in classical algorithms for data integration to enable investigation of different data types at the same level. Applying these methods on biological data gives more insight into associating diseases with heterogeneous groups of patients. In this paper, we improve upon our previous study and propose the use of a combination of a data reduction technique and similarity network analysis (SNF) as a scalable mechanism for integrating new biological data types. We demonstrated our approach by analyzing the risk factors of Acute Myeloid Leukemia (AML) patients when multiple data sources are presented and uncover new correlations between patients and patient survival time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alyass, A., Turcotte, M., Meyre, D.: From big data analysis to personalized medicine for all: challenges and opportunities. BMC Med. Genomics 8(1), 33 (2015)
Assenov, Y., Müller, F., Lutsik, P., Walter, J., Lengauer, T., Bock, C.: Comprehensive analysis of DNA methylation data with RnBeads. Nat. Methods 11(11), 1138 (2014)
Cunningham, P., Delany, S.J.: k-nearest neighbour classifiers. Multiple Classifier Syst. 34(8), 1–17 (2007)
Dimitrakopoulos, C., et al.: Network-based integration of multi-omics data for prioritizing cancer genes. Bioinformatics 34, 2441–2448 (2018)
Haasdonk, B., Bahlmann, C.: Learning with distance substitution kernels. In: Rasmussen, C.E., Bülthoff, H.H., Schölkopf, B., Giese, M.A. (eds.) DAGM 2004. LNCS, vol. 3175, pp. 220–227. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28649-3_27
Hu, Y., Shmygelska, A., Tran, D., Eriksson, N., Tung, J.Y., Hinds, D.A.: GWAS of 89,283 individuals identifies genetic variants associated with self-reporting of being a morning person. Nat. Commun. 7, 10448 (2016)
Huynh-Thu, V.A., Sanguinetti, G.: Gene regulatory network inference: an introductory survey. In: Sanguinetti, G., Huynh-Thu, V.A. (eds.) Gene Regulatory Networks. MMB, vol. 1883, pp. 1–23. Springer, New York (2019). https://doi.org/10.1007/978-1-4939-8882-2_1
National Cancer Institute: TCGA-LAML. https://portal.gdc.cancer.gov/projects/TCGA-LAML. Accessed 30 May 2019
Irizarry, R.A., Bolstad, B.M., Collin, F., Cope, L.M., Hobbs, B., Speed, T.P.: Summaries of affymetrix genechip probe level data. Nucleic Acids Res. 31(4), e15 (2003)
Jansen, P.R., et al.: Genome-wide analysis of insomnia in 1,331,010 individuals identifies new risk loci and functional pathways. Nat. Genet. 51, 394–403 (2019)
Jemal, A., Thomas, A., Murray, T., Thun, M., et al.: Cancer statistics, 2002. Ca-A Cancer J. Clin. 52(1), 23–47 (2002)
Jolliffe, I.: Principal Component Analysis. Springer, New York (2011)
Kodinariya, T.M., Makwana, P.R.: Review on determining number of cluster in k-means clustering. Int. J. 1(6), 90–95 (2013)
Marx, V.: Machine learning, practically speaking. Nat. Methods 16, 463–467 (2019)
Meng, C., Zeleznik, O.A., Thallinger, G.G., Kuster, B., Gholami, A.M., Culhane, A.C.: Dimension reduction techniques for the integrative analysis of multi-omics data. Briefings Bioinform. 17(4), 628–641 (2016)
Moarii, M., Papaemmanuil, E.: Classification and risk assessment in AML: integrating cytogenetics and molecular profiling. Hematol. Am. Soc. Hematol. Educ. Program 2017(1), 37–44 (2017)
Pai, S., Bader, G.D.: Patient similarity networks for precision medicine. J. Mol. Biol. 430(18, Part A), 2924–2938 (2018). Theory and Application of Network Biology Toward Precision Medicine
Robinson, M.D., McCarthy, D.J., Smyth, G.K.: edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1), 139–140 (2010)
Samimi, H.: Identification of gene sets that predict acute myeloid leukemia prognosis using integrative gene network analysis. Master’s thesis, Texas State University, August 2018. txi:b4789711
Saultz, J.N., Garzon, R.: Acute myeloid leukemia: a concise review. J. Clin. Med. 5(3), 33 (2016)
Schadt, E.E., Linderman, M.D., Sorenson, J., Lee, L., Nolan, G.P.: Computational solutions to large-scale data management and analysis. Nat. Rev. Genet. 11(9), 647 (2010)
Serra, A., Fratello, M., Greco, D., Tagliaferri, R.: Data integration in genomics and systems biology. In: 2016 IEEE Congress on Evolutionary Computation (CEC), pp. 1272–1279. IEEE (2016)
Von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)
Wang, B., et al.: Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 11(3), 333 (2014)
Wanga, B., et al.: SNFtool: similarity network fusion, Published 24 April 2018. https://CRAN.R-project.org/package=SNFtool
Wanichthanarak, K., Fahrmann, J.F., Grapov, D.: Genomic, proteomic, and metabolomic data integration strategies. Biomark. Insights 10s4 (2015)
Zitnik, M., Nguyen, F., Wang, B., Leskovec, J., Goldenberg, A., Hoffman, M.M.: Machine learning for integrating data in biology and medicine: principles, practice, and opportunities. Inf. Fusion 50, 71–91 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Samimi, H., Tešić, J., Ngu, A.H.H. (2019). Patient Centric Data Integration for Improved Diagnosis and Risk Prediction. In: Gadepally, V., et al. Heterogeneous Data Management, Polystores, and Analytics for Healthcare. DMAH Poly 2019 2019. Lecture Notes in Computer Science(), vol 11721. Springer, Cham. https://doi.org/10.1007/978-3-030-33752-0_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-33752-0_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33751-3
Online ISBN: 978-3-030-33752-0
eBook Packages: Computer ScienceComputer Science (R0)