A Bayesian Nonparametric Model for Integrative Clustering of Omics Data

  • Iliana PenevaEmail author
  • Richard S. Savage
Conference paper
Part of the Springer Proceedings in Mathematics & Statistics book series (PROMS, volume 296)


Cancer is a complex disease, driven by a range of genetic and environmental factors. Many integrative clustering methods aim to provide insight into the mechanisms underlying cancer but few of them are computationally efficient and able to estimate the number of subtypes. We have developed a Bayesian nonparametric model for combined data integration and clustering called BayesCluster, which aims to identify cancer subtypes and addresses many of the issues faced by the existing integrative methods. The proposed method can integrate and use the information from multiple different datasets, and offers better cluster interpretability by using nonlocal priors. We incorporate feature learning because of the large number of predictors, and use a Dirichlet process mixture model approach to produce the patient subgroups. We ensure tractable inference with simulated annealing. We apply the model to datasets from the Cancer Genome Atlas project of glioblastoma multiforme, which contains clinical and biological data about cancer patients with extremely poor prognosis of survival. By combining all available information we are able to be better identify clinically meaningful subtypes of glioblastoma.


Bayesian nonparametrics Data integration Glioblastoma Mixture models Non-local priors 


  1. 1.
    Barash, Y., Friedman, N.: Context-specfic Bayesian clustering for gene expression data. J. Comput. Bio. 9, 169–191 (2002)CrossRefGoogle Scholar
  2. 2.
    Bishop, C.: Pattern Recognition and Machine Learning. Springer (2006)Google Scholar
  3. 3.
    Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112, 859–877 (2017)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Chaturvedi, A., Green, P., Caroll, J.D.: K-modes clustering. J. Classif. 18, 35–55 (2001)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Curtis, C., Shah, S.P., Chin, S., Turashvili, G., Rueda, O.M., Dunning, M.J., Speed, D., Lynch, A.G., Samarajiwa, S., Yuan, Y.: The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 343 (2012)CrossRefGoogle Scholar
  6. 6.
    Filkov, V., Skiena, S.: Heterogeneous data integration with the consensus clustering formalism. In: International Workshop on Data Integration in the Life Sciences, pp. 110–123. Springer (2004)Google Scholar
  7. 7.
    Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1 (2010)CrossRefGoogle Scholar
  8. 8.
    Fúquene, J., Steel, M., Rossell, D.: On Choosing Mixture Components via Non-local Priors. J. R. Stat. Society. Ser B 81, 809–837 (2019)Google Scholar
  9. 9.
    Görür, D., Rasmussen, C.E.: Dirichlet process Gaussian mixture models: choice of the base distribution. J. Comput. Sci. Technol. 25, 653–664 (2010)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Green, P.J., Richardson, S.: Modelling heterogeneity with and without the Dirichlet process. Scand. J. Stat. 28, 355–375 (2001)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A k-means clustering algorithm. J. R. Stat. Soc. Ser. C (App. Stat) 28, 100–108 (1979)Google Scholar
  12. 12.
    International Cancer Genome Consortium: International network of cancer genome projects. Nature 464, 993 (2010)CrossRefGoogle Scholar
  13. 13.
    Ishwaran, H., Zarepour, M.: Exact and approximate sum representations for the Dirichlet process. Can. J. Stat. 30, 269–283 (2002)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Jain, S., Neal, R.M.: A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. J. Comput. Graph. Stat. 13, 158–182 (2004)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Khan, M.E., Bouchard, G., Murphy, K.P., Marlin, B.M.: Variational bounds for mixed-data factor analysis. In: Advances in Neural Information Processing Systems, pp. 1108–1116 (2010)Google Scholar
  16. 16.
    Kirk, P., Griffin, J.E., Savage, R.S., Ghahramani, Z., Wild, D.L.: Bayesian correlated clustering to integrate multiple datasets. Bioinformatics 28, 3290–3297 (2012)CrossRefGoogle Scholar
  17. 17.
    Kirkpatrick, S., Gelatt, D.C., Vecchi, M.P.: Optimization by simulated annealing. Science 220, 671–680 (1983)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Klami, A., Jitta, A.: Probabilistic size-constrained microclustering. In: UAI Proceedings (2016)Google Scholar
  19. 19.
    Kormaksson, M., Booth, J.G., Figueroa, M.E., Melnick, A.: Integrative model-based clustering of microarray methylation and expression data. Ann. App. Stat. 6, 1327–1347 (2012)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Lee, Y., Lee, J., Ahn, S.H., Lee, J., Nam, D.: WNT signaling in glioblastoma and therapeutic opportunities. Nature 96, 137 (2016)Google Scholar
  21. 21.
    Liu, X., Sivaganesan, S., Yeung, K.Y., Guo, J., Bumgarner, R.E., Medvedovic, M.: Context-specific infinite mixtures for clustering gene expression profiles across diverse microarray dataset. Bionformatics 22, 1737–1744 (2006)CrossRefGoogle Scholar
  22. 22.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 1, 281–297 (1967)MathSciNetzbMATHGoogle Scholar
  23. 23.
    McCulloch, C.E.: Maximum likelihood algorithms for generalized linear mixed models. J. Am. Stat. Assoc. 92, 162–170 (1997)MathSciNetCrossRefGoogle Scholar
  24. 24.
    McLachlan, G., Peel, D.: Finite Mixture Models. Wiley, New York (2004)Google Scholar
  25. 25.
    Mo, Q., Wang, S., Seshan, V.E., Olshen, A.B., Schultz, N., Sander, C., Powers, S.R., Ladanyi, M., Shen, R.: Pattern discovery and cancer gene identification in integrated cancer genomic data. P. Nath. A. Sci. 110, 4245–4250 (2013)CrossRefGoogle Scholar
  26. 26.
    Mo, Q., Shen, R., Guo, C., Vannucci, M., Chan, K., Hilsenbeck, S.: A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. Biostatistics 19, 71–86 (2017)MathSciNetCrossRefGoogle Scholar
  27. 27.
    Neal, R.M.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9, 249–265 (2000)MathSciNetGoogle Scholar
  28. 28.
    Onogi, A., Nurimoto, M., Morita, M.: Characterization of a Bayesian genetic clustering algorithm based on a Dirichlet process prior and comparison among Bayesian clustering methods. BMC Bioinform. 12, 263 (2011)CrossRefGoogle Scholar
  29. 29.
    Peneva, I., Savage, R.S.: Identifying cancer subtypes using Bayesian data integration. In preparationGoogle Scholar
  30. 30.
    Quiroz, M., Kohn, R., Villani, M., Tran, M.: Speeding up MCMC by efficient data subsampling. J. Am. Stat. Assoc. 114, 831–843 (2019)Google Scholar
  31. 31.
    Rasmussen, C.E.: The infinite Gaussian mixture model. In: Advances in Neural Information Processing Systems, pp. 554–560 (2000)Google Scholar
  32. 32.
    Robert, C.P., Casella, G.: The Metropolis-Hastings algorithm. In: Monte Carlo Statistical Methods, pp. 231–283 Springer (1999)Google Scholar
  33. 33.
    Rossell, D., Telesca, D.: Nonlocal priors for high-dimensional estimation. J. Am. Stat. Assoc. 112, 254–265 (2017)MathSciNetCrossRefGoogle Scholar
  34. 34.
    Savage, R.S., Ghahramani, Z., Griffin, J.E., Kirk, P., Wild, D.L.: Identifying cancer subtypes in glioblastoma by combining genomic, transcriptomic and epigenomic data. In: International Conference on Machine Learning (ICML) 2012: Workshop on Machine Learning in Genetics and Genomics (2013)Google Scholar
  35. 35.
    Savage, R.S., Ghahramani, Z., Griffin, J.E., De La Cruz, B.J., Wild, D.L.: Discovering transcriptional modules by Bayesian data integration. Bioinformatics 26, i158–i167 (2010)CrossRefGoogle Scholar
  36. 36.
    Schwarz, G., et al.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)MathSciNetCrossRefGoogle Scholar
  37. 37.
    Shaweis, H., Han, C., Sivasubramiam, V., Brazil, L., Beaney, R., Sadler, G., Al-Sarraj, S., Hampton, T., Logan, J., Hurwitz, V.: Has the survival of patients with glioblastoma changed over the years? Brit. J. Can. 114, 146 (2016)CrossRefGoogle Scholar
  38. 38.
    Shen, R., Olshen, A.B., Ladanyi, M.: Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioninformatics 25, 2906–2912 (2009)CrossRefGoogle Scholar
  39. 39.
    Sherlock, C., Fearnhead, P., Roberts, G.O.: The random walk Metropolis: linking theory and practice through a case study. Stat. Sci. 25, 172–190 (2010)MathSciNetCrossRefGoogle Scholar
  40. 40.
    Suchard, M.A., Wang, Q., Chan, C., Frelinger, J., Cron, A., West, M.: Understanding GPU programming for statistical computation: studies in massively parallel massive mixtures. J. Comput. Graph. Stat. 19, 419–438 (2010)MathSciNetCrossRefGoogle Scholar
  41. 41.
    TCGA: Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012)Google Scholar
  42. 42.
    Weinstein, J.N., Collisson, E.A., Mills, G.B., Shaw, K.R., Ozenberger, B.A., Ellrott, K., Shmulevich, I., Sander, C., Stuart, J.M.: The Cancer Genome Atlas pan-cancer analysis project. Nat. Gen. 45, 1113 (2013)CrossRefGoogle Scholar
  43. 43.
    West, M., Escobar, M.D.: Hierarchical Priors And Mixture Models, With Application In Regression And Density Estimation. Institute of Statistics and Decision Sciences, Duke University (1993)Google Scholar
  44. 44.
    World Health Organisation: Cancer key facts (2018). (Cited: 15 Jan 2019)

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.University of WarwickWarwickUK
  2. 2.Department of StatisticsUniversity of WarwickWarwickUK

Personalised recommendations