Integrative clustering methods of multi-omics data for molecule-based cancer classifications

Abstract

One goal of precise oncology is to re-classify cancer based on molecular features rather than its tissue origin. Integrative clustering of large-scale multi-omics data is an important way for molecule-based cancer classification. The data heterogeneity and the complexity of inter-omics variations are two major challenges for the integrative clustering analysis. According to the different strategies to deal with these difficulties, we summarized the clustering methods as three major categories: direct integrative clustering, clustering of clusters and regulatory integrative clustering. A few practical considerations on data pre-processing, post-clustering analysis and pathway-based analysis are also discussed.

References

  1. 1.

    Garraway, L. A., Verweij, J. and Ballman, K. V. (2013) Precision oncology: an overview. J. Clin. Oncol., 31, 1803–1805

    Article  PubMed  Google Scholar 

  2. 2.

    Shrager, J. and Tenenbaum, J. M. (2014) Rapid learning for precision oncology. Nat. Rev. Clin. Oncol., 11, 109–118

    Article  PubMed  Google Scholar 

  3. 3.

    Hoadley, K. A., Yau, C.,Wolf, D. M., Cherniack, A. D., Tamborero, D., Ng, S., Leiserson, M. D., Niu, B., McLellan, M. D., Uzunangelov, V., et al. (2014) Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell, 158, 929–944

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  4. 4.

    Ritchie, M. D., Holzinger, E. R., Li, R., Pendergrass, S. A. and Kim, D. (2015) Methods of integrating data to uncover genotype-phenotype interactions. Nat. Rev. Genet., 16, 85–97

    CAS  Article  PubMed  Google Scholar 

  5. 5.

    Liu, Z., Zhang, X. S. and Zhang, S. (2014) Breast tumor subgroups reveal diverse clinical prognostic power. Sci. Rep., 4, 4002

    PubMed  Google Scholar 

  6. 6.

    Han, L., Yuan, Y., Zheng, S., Yang, Y., Li, J., Edgerton, M. E., Diao, L., Xu, Y., Verhaak, R. G. and Liang, H. (2014) The Pan-Cancer analysis of pseudogene expression reveals biologically and clinically relevant tumour subtypes. Nat. Commun., 5, 3963

    CAS  PubMed  PubMed Central  Google Scholar 

  7. 7.

    Curtis, C., Shah, S. P., Chin, S. F., Turashvili, G., Rueda, O. M., Dunning, M. J., Speed, D., Lynch, A. G., Samarajiwa, S., Yuan, Y., et al. (2012) The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature, 486, 346–352

    CAS  PubMed  PubMed Central  Google Scholar 

  8. 8.

    Cancer Genome Atlas, N. (2012) Comprehensive molecular portraits of human breast tumours. Nature, 490, 61–70

    Article  Google Scholar 

  9. 9.

    Popat, S., Hubner, R. and Houlston, R. S. (2005) Systematic review of microsatellite instability and colorectal cancer prognosis. J. Clin. Oncol., 23, 609–618

    CAS  Article  PubMed  Google Scholar 

  10. 10.

    Issa, J. P. (2004) CpG island methylator phenotype in cancer. Nat. Rev. Cancer, 4, 988–993

    CAS  Article  PubMed  Google Scholar 

  11. 11.

    Kristensen, V. N., Lingjærde, O. C., Russnes, H. G., Vollan, H. K., Frigessi, A. and Børresen-Dale, A. L. (2014) Principles and methods of integrative genomic analyses in cancer. Nat. Rev. Cancer, 14, 299–313

    CAS  Article  PubMed  Google Scholar 

  12. 12.

    Zhang, W., Liu, Y., Sun, N., Wang, D., Boyd-Kirkup, J., Dou, X. and Han, J. D. (2013) Integrating genomic, epigenomic, and transcriptomic features reveals modular signatures underlying poor prognosis in ovarian cancer. Cell Reports, 4, 542–553

    CAS  Article  PubMed  Google Scholar 

  13. 13.

    Mo, Q., Wang, S., Seshan, V. E., Olshen, A. B., Schultz, N., Sander, C., Powers, R. S., Ladanyi, M. and Shen, R. (2013) Pattern discovery and cancer gene identification in integrated cancer genomic data. Proc. Natl. Acad. Sci. USA, 110, 4245–4250

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  14. 14.

    Lock, E. F., Hoadley, K. A., Marron, J. S. and Nobel, A. B. (2013) Joint and Individual Variation Explained (Jive) for integrated analysis of multiple data types. Ann. Appl. Stat., 7, 523–542

    Article  PubMed  PubMed Central  Google Scholar 

  15. 15.

    Wu, D., Wang, D., Gu, J. and Zhang, M. Q. (2015) Fast dimension reduction and integrative clustering of multi-omics data using low-rank approximation: application to cancer molecular classification. BMC Genomics, 16, 1022.

    Article  PubMed  PubMed Central  Google Scholar 

  16. 16.

    Zhang, S., Liu, C. C., Li, W., Shen, H., Laird, P. W. and Zhou, X. J. (2012) Discovery of multi-dimensional modules by integrative analysis of cancer genomic data. Nucleic Acids Res., 40, 9379–9391

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  17. 17.

    Drier, Y., Sheffer, M. and Domany, E. (2013) Pathway-based personalized analysis of cancer. Proc. Natl. Acad. Sci. USA, 110, 6388–6393

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  18. 18.

    Kirk, P., Griffin, J. E., Savage, R. S., Ghahramani, Z. and Wild, D. L. (2012) Bayesian correlated clustering to integrate multiple datasets. Bioinformatics, 28, 3290–3297

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  19. 19.

    Lock, E. F. and Dunson, D. B. (2013) Bayesian consensus clustering. Bioinformatics, 29, 2610–2616

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  20. 20.

    Wang, B., Mezlini, A. M., Demir, F., Fiume, M., Tu, Z., Brudno, M., Haibe-Kains, B. and Goldenberg, A. (2014) Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods, 11, 333–337

    CAS  Article  PubMed  Google Scholar 

  21. 21.

    Vaske, C. J., Benz, S. C., Sanborn, J. Z., Earl, D., Szeto, C., Zhu, J., Haussler, D. and Stuart, J. M. (2010) Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics, 26, i237–i245

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  22. 22.

    Shen, R., Olshen, A. B. and Ladanyi, M. (2009) Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics, 25, 2906–2912

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  23. 23.

    Zhang, S., Li, Q., Liu, J. and Zhou, X. J. (2011) A novel computational framework for simultaneous integration of multiple types of genomic data to identify microRNA-gene regulatory modules. Bioinformatics, 27, i401–i409

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  24. 24.

    Candes, E. J., Li, X. D., Ma, Y. and Wright, J. (2011) Robust principal component analysis? J. ACM, 58

  25. 25.

    Boyd, S. Parikh, N. Chu, E. Peleato, A B. Eckstein (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3, 1–122

    Article  Google Scholar 

  26. 26.

    Candès, E. J. and Recht, B. (2009) Exact matrix completion via convex optimization. Found. Comput. Math., 9, 717–772

    Article  Google Scholar 

  27. 27.

    Cai, J. F., Candes, E. J. and Shen, Z. W. (2010) A singular value thresholding algorithm for matrix completion. SIAM J. Optim., 20, 1956–1982.

    Article  Google Scholar 

  28. 28.

    Zhou, X., Liu, J., Wan, X. and Yu, W. (2014) Piecewise-constant and low-rank approximation for identification of recurrent copy number variations. Bioinformatics, 30, 1943–1949

    CAS  Article  PubMed  Google Scholar 

  29. 29.

    Chung, N. C. and Storey, J. D. (2015) Statistical significance of variables driving systematic variation in high-dimensional data. Bioinformatics, 31, 545–554

    Article  PubMed  PubMed Central  Google Scholar 

  30. 30.

    Linting, M., van Os, B. J. and Meulman, J. J. (2011) Statistical significance of the contribution of variables to the PCA solution: an alternative permutation strategy. Psychometrika, 76, 440–460.

    Article  Google Scholar 

  31. 31.

    Friedman, J., Hastie, T. and Tibshirani, R. (2009) The Elements of Statistical Learning. New York: Springer-Verlag

  32. 32.

    Jain, A. K., Murty, M. N., and Flynn, P. J. (1999) Data clustering: a review. ACM computing surveys (CSUR), 31, 264–323

    Article  Google Scholar 

  33. 33.

    Han, J., Kamber, M. and Pei, J. (2011) Data mining: concepts and techniques: concepts and techniques. San Francisco: Morgan Kaufmann

  34. 34.

    Rodriguez, A. and Laio, A. (2014) Clustering by fast search and find of density peaks. Science, 344, 1492–1496

    CAS  Article  PubMed  Google Scholar 

  35. 35.

    Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003) Latent dirichlet allocation. J. Mach. Learn. Res., 3, 993–1022

    Google Scholar 

  36. 36.

    Nguyen, X. and Gelfand, A. E. (2011) The Dirichlet labeling process for clustering functional data. Stat. Sin., 21, 1249–1289.

    Article  Google Scholar 

  37. 37.

    Dahl, D. B. (2006) Model-based clustering for expression data via a Dirichlet process mixture model. In Bayesian inference for gene expression and proteomics, 201–218, Cambridge: Cambridge University Press

    Google Scholar 

  38. 38.

    Savage, R. S., Ghahramani, Z., Griffin, J. E., Kirk, P. and Wild, D. L. (2013) Identifying cancer subtypes in glioblastoma by combining genomic, transcriptomic and epigenomic data. arXiv:1304.3577

    Google Scholar 

  39. 39.

    Nguyen, N. and Caruana, R. (2007) Consensus clusterings. In Data Mining, ICDM 2007. Seventh IEEE International Conference, 607–612

    Google Scholar 

  40. 40.

    Goder, A. and Filkov, V. (2008) Consensus Clustering Algorithms: Comparison and Refinement. in Alenex, SIAM., 109–117

    Google Scholar 

  41. 41.

    Girvan, M. and Newman, M. E. (2002) Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA, 99, 7821–7826

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  42. 42.

    Newman, M. E. (2006) Modularity and community structure in networks. Proc. Natl. Acad. Sci. USA, 103, 8577–8582

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  43. 43.

    Ng, A. Y., Jordan, M. I. and Weiss, Y. (2001) On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems. 849–856, Cambridge: MIT Press

    Google Scholar 

  44. 44.

    von Luxburg, U. (2007) A tutorial on spectral clustering. Stat. Comput., 17, 395–416.

    Article  Google Scholar 

  45. 45.

    Enright, A. J., van Dongen, S. and Ouzounis, C. A. (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res., 30, 1575–1584

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  46. 46.

    Levandowsky, M. and Winter, D. (1971) Distance between sets. Nature, 234, 34–35.

    Article  Google Scholar 

  47. 47.

    Hubert, L. and Arabie, P. (1985) Comparing partitions. J. Classif., 2, 193–218.

    Article  Google Scholar 

  48. 48.

    Alizadeh, A. A., Aranda, V., Bardelli, A., Blanpain, C., Bock, C., Borowski, C., Caldas, C., Califano, A., Doherty, M., Elsner, M., et al. (2015) Toward understanding and exploiting tumor heterogeneity. Nat. Med., 21, 846–853

    CAS  Article  PubMed  Google Scholar 

  49. 49.

    Kan, Z., Jaiswal, B. S., Stinson, J., Janakiraman, V., Bhatt, D., Stern, H. M., Yue, P., Haverty, P. M., Bourgon, R., Zheng, J., et al. (2010) Diverse somatic mutation patterns and pathway alterations in human cancers. Nature, 466, 869–873

    CAS  Article  PubMed  Google Scholar 

  50. 50.

    Lohr, J. G., Stojanov, P., Lawrence, M. S., Auclair, D., Chapuy, B., Sougnez, C., Cruz-Gordillo, P., Knoechel, B., Asmann, Y.W., Slager, S. L., et al. (2012) Discovery and prioritization of somatic mutations in diffuse large B-cell lymphoma (DLBCL) by whole-exome sequencing. Proc. Natl. Acad. Sci. USA, 109, 3879–3884

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  51. 51.

    Lawrence, M. S., Stojanov, P., Polak, P., Kryukov, G. V., Cibulskis, K., Sivachenko, A., Carter, S. L., Stewart, C., Mermel, C. H., Roberts, S. A., et al. (2013) Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature, 499, 214–218

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  52. 52.

    Villanueva, A., Portela, A., Sayols, S., Battiston, C., Hoshida, Y., Méndez-González, J., Imbeaud, S., Letouzé, E., Hernandez-Gea, V., Cornella, H., et al. (2015) DNA methylation-based prognosis and epidrivers in hepatocellular carcinoma. Hepatology, 61, 1945–1956

    CAS  Article  PubMed  Google Scholar 

  53. 53.

    Eifert, C. and Powers, R. S. (2012) From cancer genomes to oncogenic drivers, tumour dependencies and therapeutic targets. Nat. Rev. Cancer, 12, 572–578

    CAS  Article  PubMed  Google Scholar 

  54. 54.

    Sanchez-Garcia, F., Villagrasa, P., Matsui, J., Kotliar, D., Castro, V., Akavia, U. D., Chen, B. J., Saucedo-Cuevas, L., Rodriguez Barrueco, R., Llobet-Navas, D., et al. (2014) Integration of genomic data enables selective discovery of breast cancer drivers. Cell, 159, 1461–1475

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  55. 55.

    Shalem, O., Sanjana, N. E., Hartenian, E., Shi, X., Scott, D. A., Mikkelsen, T. S., Heckl, D., Ebert, B. L., Root, D. E., Doench, J. G., et al. (2014) Genome-scale CRISPR-Cas9 knockout screening in human cells. Science, 343, 84–87

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  56. 56.

    Jiang, P., Wang, H., Li, W., Zang, C., Li, B., Wong, Y. J., Meyer, C., Liu, J. S., Aster, J. C. and Liu, X. S. (2015) Network analysis of gene essentiality in functional genomics experiments. Genome Biol., 16, 239

    Article  PubMed  PubMed Central  Google Scholar 

  57. 57.

    Chen, J. C., Alvarez, M. J., Talos, F., Dhruv, H., Rieckhof, G. E., Iyer, A., Diefes, K. L., Aldape, K., Berens, M., Shen, M. M., et al. (2014) Identification of causal genetic drivers of human disease through systems-level analysis of regulatory networks. Cell, 159, 402–414

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  58. 58.

    Fehrmann, R. S., Karjalainen, J. M., Krajewska, M., Westra, H. J., Maloney, D., Simeonov, A., Pers, T. H., Hirschhorn, J. N., Jansen, R. C., Schultes, E. A., et al. (2015) Gene expression analysis identifies global gene dosage sensitivity in cancer. Nat. Genet., 47, 115–125

    CAS  Article  PubMed  Google Scholar 

  59. 59.

    Rockman, M. V. and Kruglyak, L. (2006) Genetics of global gene expression. Nat. Rev. Genet., 7, 862–872

    CAS  Article  PubMed  Google Scholar 

  60. 60.

    Akavia, U. D., Litvin, O., Kim, J., Sanchez-Garcia, F., Kotliar, D., Causton, H. C., Pochanard, P., Mozes, E., Garraway, L. A. and Pe’er, D. (2010) An integrated approach to uncover drivers of cancer. Cell, 143, 1005–1017

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  61. 61.

    Li, Q., Seo, J. H., Stranger, B., McKenna, A., Pe’er, I., Laframboise, T., Brown, M., Tyekucheva, S. and Freedman, M. L. (2013) Integrative eQTL-based analyses reveal the biology of breast cancer risk loci. Cell, 152, 633–641

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  62. 62.

    Cancer Genome Atlas Research Network. (2014) Integrated genomic characterization of papillary thyroid carcinoma. Cell, 159, 676–690

    Article  Google Scholar 

  63. 63.

    Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B., Johnson, W. E., Geman, D., Baggerly, K. and Irizarry, R. A. (2010) Tackling the widespread and critical impact of batch effects in highthroughput data. Nat. Rev. Genet., 11, 733–739

    CAS  Article  PubMed  Google Scholar 

  64. 64.

    Eisenberg, E. and Levanon, E. Y. (2003) Human housekeeping genes are compact. Trends Genet., 19, 362–365

    CAS  Article  PubMed  Google Scholar 

  65. 65.

    van der Maaten, L. and Hinton, G. (2008) Visualizing Data using t- SNE. J. Mach. Learn. Res., 9, 2579–2605.

    Google Scholar 

  66. 66.

    Hoyer, P. O. (2004) Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res., 5, 1457–1469.

    Google Scholar 

  67. 67.

    Lee, D. D. and Seung, H. S. (1999) Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788–791

    CAS  Article  PubMed  Google Scholar 

  68. 68.

    Kanehisa, M., Goto, S., Sato, Y., Furumichi, M. and Tanabe, M. (2012) KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res., 40, D109–D114

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  69. 69.

    Croft, D., O’Kelly, G., Wu, G., Haw, R., Gillespie, M., Matthews, L., Caudy, M., Garapati, P., Gopinath, G., Jassal, B., et al. (2011) Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res., 39, D691–D697

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  70. 70.

    Caspi, R., Altman, T., Billington, R., Dreher, K., Foerster, H., Fulcher, C. A., Holland, T. A., Keseler, I. M., Kothari, A., Kubo, A., et al. (2014) The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res., 42, D459–D471

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  71. 71.

    Livshits, A., Git, A., Fuks, G., Caldas, C. and Domany, E. (2015) Pathway-based personalized analysis of breast cancer expression data. Mol. Oncol., 9, 1471–1483

    CAS  Article  PubMed  Google Scholar 

  72. 72.

    Tarca, A. L., Draghici, S., Khatri, P., Hassan, S. S., Mittal, P., Kim, J. S., Kim, C. J., Kusanovic, J. P. and Romero, R. (2009) A novel signaling pathway impact analysis. Bioinformatics, 25, 75–82

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  73. 73.

    Paull, E. O., Carlin, D. E., Niepel, M., Sorger, P. K., Haussler, D. and Stuart, J. M. (2013) Discovering causal pathways linking genomic events to transcriptional states using Tied Diffusion Through Interacting Events (TieDIE). Bioinformatics, 29, 2757–2764

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  74. 74.

    Hofree, M., Shen, J. P., Carter, H., Gross, A. and Ideker, T. (2013) Network-based stratification of tumor mutations. Nat. Methods, 10, 1108–1115

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  75. 75.

    Liu, Z. and Zhang, S. (2015) Tumor characterization and stratification by integrated molecular profiles reveals essential pan-cancer features. BMC Genomics, 16, 503

    Article  PubMed  PubMed Central  Google Scholar 

  76. 76.

    Cancer Genome Atlas Research Network, Weinstein, J. N., Collisson, E. A., Mills, G. B., Shaw, K. R., Ozenberger, B. A., Ellrott, K., Shmulevich, I., Sander, C. andStuart, J. M. (2013) The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet., 45, 1113–1120

    Article  PubMed Central  Google Scholar 

  77. 77.

    Cancer Genome Atlas Research Network. (2014) Comprehensive molecular characterization of gastric adenocarcinoma. Nature, 513, 202–209

    Article  Google Scholar 

  78. 78.

    Yuan, Y., van Allen, E. M., Omberg, L., Wagle, N., Amin-Mansour, A., Sokolov, A., Byers, L. A., Xu, Y., Hess, K. R., Diao, L., et al. (2014) Assessing the clinical utility of cancer genomic and proteomic data across tumor types. Nat. Biotechnol., 32, 644–652

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  79. 79.

    Wold, S., Martens, H. and Wold, H. (1983) The multivariate calibrationproblem in chemistry solved by the Pls Method. Lect. Notes Math., 973, 286–293.

    Article  Google Scholar 

  80. 80.

    Bastien, P., Bertrand, F., Meyer, N. and Maumy-Bertrand, M. (2015) Deviance residuals-based sparse PLS and sparse kernel PLS regression for censored data. Bioinformatics, 31, 397–404

    Article  PubMed  Google Scholar 

  81. 81.

    Aronson, S. J. and Rehm, H. L. (2015) Building the foundation for genomics in precision medicine. Nature, 526, 336–342

    CAS  Article  PubMed  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Jin Gu.

Additional information

This article is dedicated to the Special Collection of Recent Advances in Next-Generation Bioinformatics (Ed. Xuegong Zhang).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wang, D., Gu, J. Integrative clustering methods of multi-omics data for molecule-based cancer classifications. Quant Biol 4, 58–67 (2016). https://doi.org/10.1007/s40484-016-0063-4

Download citation

Keywords

  • clustering
  • cancer classification
  • omics
  • integrative analysis