A Bayesian Nonparametric Model for Integrative Clustering of Omics Data

Peneva, Iliana; Savage, Richard S.

doi:10.1007/978-3-030-30611-3_11

Iliana Peneva⁴ &
Richard S. Savage⁵

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 296))

Included in the following conference series:

International Conference on Bayesian Statistics in Action

851 Accesses

Abstract

Cancer is a complex disease, driven by a range of genetic and environmental factors. Many integrative clustering methods aim to provide insight into the mechanisms underlying cancer but few of them are computationally efficient and able to estimate the number of subtypes. We have developed a Bayesian nonparametric model for combined data integration and clustering called BayesCluster, which aims to identify cancer subtypes and addresses many of the issues faced by the existing integrative methods. The proposed method can integrate and use the information from multiple different datasets, and offers better cluster interpretability by using nonlocal priors. We incorporate feature learning because of the large number of predictors, and use a Dirichlet process mixture model approach to produce the patient subgroups. We ensure tractable inference with simulated annealing. We apply the model to datasets from the Cancer Genome Atlas project of glioblastoma multiforme, which contains clinical and biological data about cancer patients with extremely poor prognosis of survival. By combining all available information we are able to be better identify clinically meaningful subtypes of glioblastoma.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Barash, Y., Friedman, N.: Context-specfic Bayesian clustering for gene expression data. J. Comput. Bio. 9, 169–191 (2002)
Article Google Scholar
Bishop, C.: Pattern Recognition and Machine Learning. Springer (2006)
Google Scholar
Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112, 859–877 (2017)
Article MathSciNet Google Scholar
Chaturvedi, A., Green, P., Caroll, J.D.: K-modes clustering. J. Classif. 18, 35–55 (2001)
Article MathSciNet Google Scholar
Curtis, C., Shah, S.P., Chin, S., Turashvili, G., Rueda, O.M., Dunning, M.J., Speed, D., Lynch, A.G., Samarajiwa, S., Yuan, Y.: The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 343 (2012)
Article Google Scholar
Filkov, V., Skiena, S.: Heterogeneous data integration with the consensus clustering formalism. In: International Workshop on Data Integration in the Life Sciences, pp. 110–123. Springer (2004)
Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1 (2010)
Article Google Scholar
Fúquene, J., Steel, M., Rossell, D.: On Choosing Mixture Components via Non-local Priors. J. R. Stat. Society. Ser B 81, 809–837 (2019)
Google Scholar
Görür, D., Rasmussen, C.E.: Dirichlet process Gaussian mixture models: choice of the base distribution. J. Comput. Sci. Technol. 25, 653–664 (2010)
Article MathSciNet Google Scholar
Green, P.J., Richardson, S.: Modelling heterogeneity with and without the Dirichlet process. Scand. J. Stat. 28, 355–375 (2001)
Article MathSciNet Google Scholar
Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A k-means clustering algorithm. J. R. Stat. Soc. Ser. C (App. Stat) 28, 100–108 (1979)
Google Scholar
International Cancer Genome Consortium: International network of cancer genome projects. Nature 464, 993 (2010)
Article Google Scholar
Ishwaran, H., Zarepour, M.: Exact and approximate sum representations for the Dirichlet process. Can. J. Stat. 30, 269–283 (2002)
Article MathSciNet Google Scholar
Jain, S., Neal, R.M.: A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. J. Comput. Graph. Stat. 13, 158–182 (2004)
Article MathSciNet Google Scholar
Khan, M.E., Bouchard, G., Murphy, K.P., Marlin, B.M.: Variational bounds for mixed-data factor analysis. In: Advances in Neural Information Processing Systems, pp. 1108–1116 (2010)
Google Scholar
Kirk, P., Griffin, J.E., Savage, R.S., Ghahramani, Z., Wild, D.L.: Bayesian correlated clustering to integrate multiple datasets. Bioinformatics 28, 3290–3297 (2012)
Article Google Scholar
Kirkpatrick, S., Gelatt, D.C., Vecchi, M.P.: Optimization by simulated annealing. Science 220, 671–680 (1983)
Article MathSciNet Google Scholar
Klami, A., Jitta, A.: Probabilistic size-constrained microclustering. In: UAI Proceedings (2016)
Google Scholar
Kormaksson, M., Booth, J.G., Figueroa, M.E., Melnick, A.: Integrative model-based clustering of microarray methylation and expression data. Ann. App. Stat. 6, 1327–1347 (2012)
Article MathSciNet Google Scholar
Lee, Y., Lee, J., Ahn, S.H., Lee, J., Nam, D.: WNT signaling in glioblastoma and therapeutic opportunities. Nature 96, 137 (2016)
Google Scholar
Liu, X., Sivaganesan, S., Yeung, K.Y., Guo, J., Bumgarner, R.E., Medvedovic, M.: Context-specific infinite mixtures for clustering gene expression profiles across diverse microarray dataset. Bionformatics 22, 1737–1744 (2006)
Article Google Scholar
MacQueen, J.: Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 1, 281–297 (1967)
MathSciNet MATH Google Scholar
McCulloch, C.E.: Maximum likelihood algorithms for generalized linear mixed models. J. Am. Stat. Assoc. 92, 162–170 (1997)
Article MathSciNet Google Scholar
McLachlan, G., Peel, D.: Finite Mixture Models. Wiley, New York (2004)
Google Scholar
Mo, Q., Wang, S., Seshan, V.E., Olshen, A.B., Schultz, N., Sander, C., Powers, S.R., Ladanyi, M., Shen, R.: Pattern discovery and cancer gene identification in integrated cancer genomic data. P. Nath. A. Sci. 110, 4245–4250 (2013)
Article Google Scholar
Mo, Q., Shen, R., Guo, C., Vannucci, M., Chan, K., Hilsenbeck, S.: A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. Biostatistics 19, 71–86 (2017)
Article MathSciNet Google Scholar
Neal, R.M.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9, 249–265 (2000)
MathSciNet Google Scholar
Onogi, A., Nurimoto, M., Morita, M.: Characterization of a Bayesian genetic clustering algorithm based on a Dirichlet process prior and comparison among Bayesian clustering methods. BMC Bioinform. 12, 263 (2011)
Article Google Scholar
Peneva, I., Savage, R.S.: Identifying cancer subtypes using Bayesian data integration. In preparation
Google Scholar
Quiroz, M., Kohn, R., Villani, M., Tran, M.: Speeding up MCMC by efficient data subsampling. J. Am. Stat. Assoc. 114, 831–843 (2019)
Google Scholar
Rasmussen, C.E.: The infinite Gaussian mixture model. In: Advances in Neural Information Processing Systems, pp. 554–560 (2000)
Google Scholar
Robert, C.P., Casella, G.: The Metropolis-Hastings algorithm. In: Monte Carlo Statistical Methods, pp. 231–283 Springer (1999)
Google Scholar
Rossell, D., Telesca, D.: Nonlocal priors for high-dimensional estimation. J. Am. Stat. Assoc. 112, 254–265 (2017)
Article MathSciNet Google Scholar
Savage, R.S., Ghahramani, Z., Griffin, J.E., Kirk, P., Wild, D.L.: Identifying cancer subtypes in glioblastoma by combining genomic, transcriptomic and epigenomic data. In: International Conference on Machine Learning (ICML) 2012: Workshop on Machine Learning in Genetics and Genomics (2013)
Google Scholar
Savage, R.S., Ghahramani, Z., Griffin, J.E., De La Cruz, B.J., Wild, D.L.: Discovering transcriptional modules by Bayesian data integration. Bioinformatics 26, i158–i167 (2010)
Article Google Scholar
Schwarz, G., et al.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
Article MathSciNet Google Scholar
Shaweis, H., Han, C., Sivasubramiam, V., Brazil, L., Beaney, R., Sadler, G., Al-Sarraj, S., Hampton, T., Logan, J., Hurwitz, V.: Has the survival of patients with glioblastoma changed over the years? Brit. J. Can. 114, 146 (2016)
Article Google Scholar
Shen, R., Olshen, A.B., Ladanyi, M.: Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioninformatics 25, 2906–2912 (2009)
Article Google Scholar
Sherlock, C., Fearnhead, P., Roberts, G.O.: The random walk Metropolis: linking theory and practice through a case study. Stat. Sci. 25, 172–190 (2010)
Article MathSciNet Google Scholar
Suchard, M.A., Wang, Q., Chan, C., Frelinger, J., Cron, A., West, M.: Understanding GPU programming for statistical computation: studies in massively parallel massive mixtures. J. Comput. Graph. Stat. 19, 419–438 (2010)
Article MathSciNet Google Scholar
TCGA: Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012)
Google Scholar
Weinstein, J.N., Collisson, E.A., Mills, G.B., Shaw, K.R., Ozenberger, B.A., Ellrott, K., Shmulevich, I., Sander, C., Stuart, J.M.: The Cancer Genome Atlas pan-cancer analysis project. Nat. Gen. 45, 1113 (2013)
Article Google Scholar
West, M., Escobar, M.D.: Hierarchical Priors And Mixture Models, With Application In Regression And Density Estimation. Institute of Statistics and Decision Sciences, Duke University (1993)
Google Scholar
World Health Organisation: Cancer key facts (2018). http://www.who.int/en/news-room/fact-sheets/detail/cancer (Cited: 15 Jan 2019)

Download references

Author information

Authors and Affiliations

University of Warwick, Warwick, UK
Iliana Peneva
Department of Statistics, University of Warwick, Warwick, UK
Richard S. Savage

Authors

Iliana Peneva
View author publications
You can also search for this author in PubMed Google Scholar
Richard S. Savage
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Iliana Peneva .

Editor information

Editors and Affiliations

Department of Statistical Sciences, Università Cattolica del Sacro Cuore, Milan, Italy
Raffaele Argiento
Department of Decision Sciences, Bocconi University, Milan, Italy
Daniele Durante
School of Mathematics, University of Edinburgh, Edinburgh, UK
Sara Wade

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peneva, I., Savage, R.S. (2019). A Bayesian Nonparametric Model for Integrative Clustering of Omics Data. In: Argiento, R., Durante, D., Wade, S. (eds) Bayesian Statistics and New Generations. BAYSM 2018. Springer Proceedings in Mathematics & Statistics, vol 296. Springer, Cham. https://doi.org/10.1007/978-3-030-30611-3_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-30611-3_11
Published: 22 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30610-6
Online ISBN: 978-3-030-30611-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics