A Bayesian Nonparametric Model for Integrative Clustering of Omics Data
Abstract
Cancer is a complex disease, driven by a range of genetic and environmental factors. Many integrative clustering methods aim to provide insight into the mechanisms underlying cancer but few of them are computationally efficient and able to estimate the number of subtypes. We have developed a Bayesian nonparametric model for combined data integration and clustering called BayesCluster, which aims to identify cancer subtypes and addresses many of the issues faced by the existing integrative methods. The proposed method can integrate and use the information from multiple different datasets, and offers better cluster interpretability by using nonlocal priors. We incorporate feature learning because of the large number of predictors, and use a Dirichlet process mixture model approach to produce the patient subgroups. We ensure tractable inference with simulated annealing. We apply the model to datasets from the Cancer Genome Atlas project of glioblastoma multiforme, which contains clinical and biological data about cancer patients with extremely poor prognosis of survival. By combining all available information we are able to be better identify clinically meaningful subtypes of glioblastoma.
Keywords
Bayesian nonparametrics Data integration Glioblastoma Mixture models Non-local priorsReferences
- 1.Barash, Y., Friedman, N.: Context-specfic Bayesian clustering for gene expression data. J. Comput. Bio. 9, 169–191 (2002)CrossRefGoogle Scholar
- 2.Bishop, C.: Pattern Recognition and Machine Learning. Springer (2006)Google Scholar
- 3.Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112, 859–877 (2017)MathSciNetCrossRefGoogle Scholar
- 4.Chaturvedi, A., Green, P., Caroll, J.D.: K-modes clustering. J. Classif. 18, 35–55 (2001)MathSciNetCrossRefGoogle Scholar
- 5.Curtis, C., Shah, S.P., Chin, S., Turashvili, G., Rueda, O.M., Dunning, M.J., Speed, D., Lynch, A.G., Samarajiwa, S., Yuan, Y.: The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 343 (2012)CrossRefGoogle Scholar
- 6.Filkov, V., Skiena, S.: Heterogeneous data integration with the consensus clustering formalism. In: International Workshop on Data Integration in the Life Sciences, pp. 110–123. Springer (2004)Google Scholar
- 7.Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1 (2010)CrossRefGoogle Scholar
- 8.Fúquene, J., Steel, M., Rossell, D.: On Choosing Mixture Components via Non-local Priors. J. R. Stat. Society. Ser B 81, 809–837 (2019)Google Scholar
- 9.Görür, D., Rasmussen, C.E.: Dirichlet process Gaussian mixture models: choice of the base distribution. J. Comput. Sci. Technol. 25, 653–664 (2010)MathSciNetCrossRefGoogle Scholar
- 10.Green, P.J., Richardson, S.: Modelling heterogeneity with and without the Dirichlet process. Scand. J. Stat. 28, 355–375 (2001)MathSciNetCrossRefGoogle Scholar
- 11.Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A k-means clustering algorithm. J. R. Stat. Soc. Ser. C (App. Stat) 28, 100–108 (1979)Google Scholar
- 12.International Cancer Genome Consortium: International network of cancer genome projects. Nature 464, 993 (2010)CrossRefGoogle Scholar
- 13.Ishwaran, H., Zarepour, M.: Exact and approximate sum representations for the Dirichlet process. Can. J. Stat. 30, 269–283 (2002)MathSciNetCrossRefGoogle Scholar
- 14.Jain, S., Neal, R.M.: A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. J. Comput. Graph. Stat. 13, 158–182 (2004)MathSciNetCrossRefGoogle Scholar
- 15.Khan, M.E., Bouchard, G., Murphy, K.P., Marlin, B.M.: Variational bounds for mixed-data factor analysis. In: Advances in Neural Information Processing Systems, pp. 1108–1116 (2010)Google Scholar
- 16.Kirk, P., Griffin, J.E., Savage, R.S., Ghahramani, Z., Wild, D.L.: Bayesian correlated clustering to integrate multiple datasets. Bioinformatics 28, 3290–3297 (2012)CrossRefGoogle Scholar
- 17.Kirkpatrick, S., Gelatt, D.C., Vecchi, M.P.: Optimization by simulated annealing. Science 220, 671–680 (1983)MathSciNetCrossRefGoogle Scholar
- 18.Klami, A., Jitta, A.: Probabilistic size-constrained microclustering. In: UAI Proceedings (2016)Google Scholar
- 19.Kormaksson, M., Booth, J.G., Figueroa, M.E., Melnick, A.: Integrative model-based clustering of microarray methylation and expression data. Ann. App. Stat. 6, 1327–1347 (2012)MathSciNetCrossRefGoogle Scholar
- 20.Lee, Y., Lee, J., Ahn, S.H., Lee, J., Nam, D.: WNT signaling in glioblastoma and therapeutic opportunities. Nature 96, 137 (2016)Google Scholar
- 21.Liu, X., Sivaganesan, S., Yeung, K.Y., Guo, J., Bumgarner, R.E., Medvedovic, M.: Context-specific infinite mixtures for clustering gene expression profiles across diverse microarray dataset. Bionformatics 22, 1737–1744 (2006)CrossRefGoogle Scholar
- 22.MacQueen, J.: Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 1, 281–297 (1967)MathSciNetzbMATHGoogle Scholar
- 23.McCulloch, C.E.: Maximum likelihood algorithms for generalized linear mixed models. J. Am. Stat. Assoc. 92, 162–170 (1997)MathSciNetCrossRefGoogle Scholar
- 24.McLachlan, G., Peel, D.: Finite Mixture Models. Wiley, New York (2004)Google Scholar
- 25.Mo, Q., Wang, S., Seshan, V.E., Olshen, A.B., Schultz, N., Sander, C., Powers, S.R., Ladanyi, M., Shen, R.: Pattern discovery and cancer gene identification in integrated cancer genomic data. P. Nath. A. Sci. 110, 4245–4250 (2013)CrossRefGoogle Scholar
- 26.Mo, Q., Shen, R., Guo, C., Vannucci, M., Chan, K., Hilsenbeck, S.: A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. Biostatistics 19, 71–86 (2017)MathSciNetCrossRefGoogle Scholar
- 27.Neal, R.M.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9, 249–265 (2000)MathSciNetGoogle Scholar
- 28.Onogi, A., Nurimoto, M., Morita, M.: Characterization of a Bayesian genetic clustering algorithm based on a Dirichlet process prior and comparison among Bayesian clustering methods. BMC Bioinform. 12, 263 (2011)CrossRefGoogle Scholar
- 29.Peneva, I., Savage, R.S.: Identifying cancer subtypes using Bayesian data integration. In preparationGoogle Scholar
- 30.Quiroz, M., Kohn, R., Villani, M., Tran, M.: Speeding up MCMC by efficient data subsampling. J. Am. Stat. Assoc. 114, 831–843 (2019)Google Scholar
- 31.Rasmussen, C.E.: The infinite Gaussian mixture model. In: Advances in Neural Information Processing Systems, pp. 554–560 (2000)Google Scholar
- 32.Robert, C.P., Casella, G.: The Metropolis-Hastings algorithm. In: Monte Carlo Statistical Methods, pp. 231–283 Springer (1999)Google Scholar
- 33.Rossell, D., Telesca, D.: Nonlocal priors for high-dimensional estimation. J. Am. Stat. Assoc. 112, 254–265 (2017)MathSciNetCrossRefGoogle Scholar
- 34.Savage, R.S., Ghahramani, Z., Griffin, J.E., Kirk, P., Wild, D.L.: Identifying cancer subtypes in glioblastoma by combining genomic, transcriptomic and epigenomic data. In: International Conference on Machine Learning (ICML) 2012: Workshop on Machine Learning in Genetics and Genomics (2013)Google Scholar
- 35.Savage, R.S., Ghahramani, Z., Griffin, J.E., De La Cruz, B.J., Wild, D.L.: Discovering transcriptional modules by Bayesian data integration. Bioinformatics 26, i158–i167 (2010)CrossRefGoogle Scholar
- 36.Schwarz, G., et al.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)MathSciNetCrossRefGoogle Scholar
- 37.Shaweis, H., Han, C., Sivasubramiam, V., Brazil, L., Beaney, R., Sadler, G., Al-Sarraj, S., Hampton, T., Logan, J., Hurwitz, V.: Has the survival of patients with glioblastoma changed over the years? Brit. J. Can. 114, 146 (2016)CrossRefGoogle Scholar
- 38.Shen, R., Olshen, A.B., Ladanyi, M.: Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioninformatics 25, 2906–2912 (2009)CrossRefGoogle Scholar
- 39.Sherlock, C., Fearnhead, P., Roberts, G.O.: The random walk Metropolis: linking theory and practice through a case study. Stat. Sci. 25, 172–190 (2010)MathSciNetCrossRefGoogle Scholar
- 40.Suchard, M.A., Wang, Q., Chan, C., Frelinger, J., Cron, A., West, M.: Understanding GPU programming for statistical computation: studies in massively parallel massive mixtures. J. Comput. Graph. Stat. 19, 419–438 (2010)MathSciNetCrossRefGoogle Scholar
- 41.TCGA: Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012)Google Scholar
- 42.Weinstein, J.N., Collisson, E.A., Mills, G.B., Shaw, K.R., Ozenberger, B.A., Ellrott, K., Shmulevich, I., Sander, C., Stuart, J.M.: The Cancer Genome Atlas pan-cancer analysis project. Nat. Gen. 45, 1113 (2013)CrossRefGoogle Scholar
- 43.West, M., Escobar, M.D.: Hierarchical Priors And Mixture Models, With Application In Regression And Density Estimation. Institute of Statistics and Decision Sciences, Duke University (1993)Google Scholar
- 44.World Health Organisation: Cancer key facts (2018). http://www.who.int/en/news-room/fact-sheets/detail/cancer (Cited: 15 Jan 2019)