Patient Centric Data Integration for Improved Diagnosis and Risk Prediction

  • Hanie SamimiEmail author
  • Jelena Tešić
  • Anne Hee Hiong Ngu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11721)


A typical biological study includes analysis of heterogeneous biological databases, e.g., genomics, proteomics, metabolomics, and microarray gene expression. These datasets correlate at the patient-level, e.g., decrease in the workload of a group of genes in body cells increases the work of other group and raises the number of their products. Joint analysis of correlated patient-level data sources improves the final diagnosis. State-of-art biological methods, such as differential expression analysis, do not support heterogeneous data source integration and analysis. Recently, scientists in different computational fields have made significant improvements in classical algorithms for data integration to enable investigation of different data types at the same level. Applying these methods on biological data gives more insight into associating diseases with heterogeneous groups of patients. In this paper, we improve upon our previous study and propose the use of a combination of a data reduction technique and similarity network analysis (SNF) as a scalable mechanism for integrating new biological data types. We demonstrated our approach by analyzing the risk factors of Acute Myeloid Leukemia (AML) patients when multiple data sources are presented and uncover new correlations between patients and patient survival time.


  1. 1.
    Alyass, A., Turcotte, M., Meyre, D.: From big data analysis to personalized medicine for all: challenges and opportunities. BMC Med. Genomics 8(1), 33 (2015)CrossRefGoogle Scholar
  2. 2.
    Assenov, Y., Müller, F., Lutsik, P., Walter, J., Lengauer, T., Bock, C.: Comprehensive analysis of DNA methylation data with RnBeads. Nat. Methods 11(11), 1138 (2014)CrossRefGoogle Scholar
  3. 3.
    Cunningham, P., Delany, S.J.: k-nearest neighbour classifiers. Multiple Classifier Syst. 34(8), 1–17 (2007)Google Scholar
  4. 4.
    Dimitrakopoulos, C., et al.: Network-based integration of multi-omics data for prioritizing cancer genes. Bioinformatics 34, 2441–2448 (2018)CrossRefGoogle Scholar
  5. 5.
    Haasdonk, B., Bahlmann, C.: Learning with distance substitution kernels. In: Rasmussen, C.E., Bülthoff, H.H., Schölkopf, B., Giese, M.A. (eds.) DAGM 2004. LNCS, vol. 3175, pp. 220–227. Springer, Heidelberg (2004). Scholar
  6. 6.
    Hu, Y., Shmygelska, A., Tran, D., Eriksson, N., Tung, J.Y., Hinds, D.A.: GWAS of 89,283 individuals identifies genetic variants associated with self-reporting of being a morning person. Nat. Commun. 7, 10448 (2016)CrossRefGoogle Scholar
  7. 7.
    Huynh-Thu, V.A., Sanguinetti, G.: Gene regulatory network inference: an introductory survey. In: Sanguinetti, G., Huynh-Thu, V.A. (eds.) Gene Regulatory Networks. MMB, vol. 1883, pp. 1–23. Springer, New York (2019). Scholar
  8. 8.
    National Cancer Institute: TCGA-LAML. Accessed 30 May 2019
  9. 9.
    Irizarry, R.A., Bolstad, B.M., Collin, F., Cope, L.M., Hobbs, B., Speed, T.P.: Summaries of affymetrix genechip probe level data. Nucleic Acids Res. 31(4), e15 (2003)CrossRefGoogle Scholar
  10. 10.
    Jansen, P.R., et al.: Genome-wide analysis of insomnia in 1,331,010 individuals identifies new risk loci and functional pathways. Nat. Genet. 51, 394–403 (2019)CrossRefGoogle Scholar
  11. 11.
    Jemal, A., Thomas, A., Murray, T., Thun, M., et al.: Cancer statistics, 2002. Ca-A Cancer J. Clin. 52(1), 23–47 (2002)CrossRefGoogle Scholar
  12. 12.
    Jolliffe, I.: Principal Component Analysis. Springer, New York (2011)zbMATHGoogle Scholar
  13. 13.
    Kodinariya, T.M., Makwana, P.R.: Review on determining number of cluster in k-means clustering. Int. J. 1(6), 90–95 (2013)Google Scholar
  14. 14.
    Marx, V.: Machine learning, practically speaking. Nat. Methods 16, 463–467 (2019)CrossRefGoogle Scholar
  15. 15.
    Meng, C., Zeleznik, O.A., Thallinger, G.G., Kuster, B., Gholami, A.M., Culhane, A.C.: Dimension reduction techniques for the integrative analysis of multi-omics data. Briefings Bioinform. 17(4), 628–641 (2016)CrossRefGoogle Scholar
  16. 16.
    Moarii, M., Papaemmanuil, E.: Classification and risk assessment in AML: integrating cytogenetics and molecular profiling. Hematol. Am. Soc. Hematol. Educ. Program 2017(1), 37–44 (2017)CrossRefGoogle Scholar
  17. 17.
    Pai, S., Bader, G.D.: Patient similarity networks for precision medicine. J. Mol. Biol. 430(18, Part A), 2924–2938 (2018). Theory and Application of Network Biology Toward Precision MedicineCrossRefGoogle Scholar
  18. 18.
    Robinson, M.D., McCarthy, D.J., Smyth, G.K.: edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1), 139–140 (2010)CrossRefGoogle Scholar
  19. 19.
    Samimi, H.: Identification of gene sets that predict acute myeloid leukemia prognosis using integrative gene network analysis. Master’s thesis, Texas State University, August 2018. txi:b4789711Google Scholar
  20. 20.
    Saultz, J.N., Garzon, R.: Acute myeloid leukemia: a concise review. J. Clin. Med. 5(3), 33 (2016)CrossRefGoogle Scholar
  21. 21.
    Schadt, E.E., Linderman, M.D., Sorenson, J., Lee, L., Nolan, G.P.: Computational solutions to large-scale data management and analysis. Nat. Rev. Genet. 11(9), 647 (2010)CrossRefGoogle Scholar
  22. 22.
    Serra, A., Fratello, M., Greco, D., Tagliaferri, R.: Data integration in genomics and systems biology. In: 2016 IEEE Congress on Evolutionary Computation (CEC), pp. 1272–1279. IEEE (2016)Google Scholar
  23. 23.
    Von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Wang, B., et al.: Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 11(3), 333 (2014)CrossRefGoogle Scholar
  25. 25.
    Wanga, B., et al.: SNFtool: similarity network fusion, Published 24 April 2018.
  26. 26.
    Wanichthanarak, K., Fahrmann, J.F., Grapov, D.: Genomic, proteomic, and metabolomic data integration strategies. Biomark. Insights 10s4 (2015) Google Scholar
  27. 27.
    Zitnik, M., Nguyen, F., Wang, B., Leskovec, J., Goldenberg, A., Hoffman, M.M.: Machine learning for integrating data in biology and medicine: principles, practice, and opportunities. Inf. Fusion 50, 71–91 (2019)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Hanie Samimi
    • 1
    Email author
  • Jelena Tešić
    • 1
  • Anne Hee Hiong Ngu
    • 1
  1. 1.Department of Computer ScienceTexas State UniversitySan MarcosUSA

Personalised recommendations