Bayesian nonparametric clustering for large data sets

  • Daiane Aparecida Zuanetti
  • Peter Müller
  • Yitan Zhu
  • Shengjie Yang
  • Yuan Ji


We propose two nonparametric Bayesian methods to cluster big data and apply them to cluster genes by patterns of gene–gene interaction. Both approaches define model-based clustering with nonparametric Bayesian priors and include an implementation that remains feasible for big data. The first method is based on a predictive recursion which requires a single cycle (or few cycles) of simple deterministic calculations for each observation under study. The second scheme is an exact method that divides the data into smaller subsamples and involves local partitions that can be determined in parallel. In a second step, the method requires only the sufficient statistics of each of these local clusters to derive global clusters. Under simulated and benchmark data sets the proposed methods compare favorably with other clustering algorithms, including k-means, DP-means, DBSCAN, SUGS, streaming variational Bayes and an EM algorithm. We apply the proposed approaches to cluster a large data set of gene–gene interactions extracted from the online search tool “Zodiac.”


Big data clustering Gene–gene interactions Predictive recursion Nonparametric Bayes TCGA 



D. Zuanetti was supported by CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brazil. Peter Müller and Yuan Ji are supported in part by NIH/NCI CA 132891-07.

Supplementary material

11222_2018_9803_MOESM1_ESM.r (10 kb)
Supplementary material 1 (R 9 KB)
11222_2018_9803_MOESM2_ESM.r (17 kb)
Supplementary material 2 (R 16 KB)
11222_2018_9803_MOESM3_ESM.pdf (120 kb)
Supplementary material 3 (pdf 119 KB)


  1. Arbel, J., Lijoi, A., Nipoti, B.: Bayesian survival model based on moment characterization. In: Frühwirth-Schnatter, S., Bitto, A., Kastner, G., Posekany, A. (eds.) Bayesian Statistics from Methods to Models and Applications, pp. 3–14. Springer, Cham (2015)CrossRefGoogle Scholar
  2. Blackwell, D., MacQueen, J.B.: Ferguson distributions via Pólya urn schemes. Ann. Stat. 1, 353–355 (1973)CrossRefMATHGoogle Scholar
  3. Bouchard-Côté, A., Vollmer, S.J., Doucet, A.: The bouncy particle sampler: a non-reversible rejection-free Markov chain Monte Carlo method. J. Am. Stat. Assoc. (2017). Google Scholar
  4. Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. Theory Methods 3(1), 1–27 (1974)MathSciNetCrossRefMATHGoogle Scholar
  5. Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling wine preferences by data mining from physicochemical properties. Decision Support Syst. 47(4), 547–553 (2009)CrossRefGoogle Scholar
  6. Dahl, D.B.: Model-based clustering for expression data via a Dirichlet process mixture model. In: Vannucci, M., Do, K.A., Müller, P. (eds.) Bayesian Inference for Gene Expression and Proteomics, pp. 201–218. Cambridge University Press, Cambridge (2006)CrossRefGoogle Scholar
  7. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. Knowl. Discov. Databases 96, 226–231 (1996)Google Scholar
  8. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. 7(2), 179–188 (1936)CrossRefGoogle Scholar
  9. Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97(458), 611–631 (2002)MathSciNetCrossRefMATHGoogle Scholar
  10. Fraley, C., Raftery, A.E.: Bayesian regularization for normal mixture estimation and model-based clustering. J. Classif. 24(2), 155–181 (2007)MathSciNetCrossRefMATHGoogle Scholar
  11. Ge, H., Chen, Y., Wan, M., Ghahramani, Z.: Distributed inference for Dirichlet process mixture models. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 37, pp. 2276–2284. PMLR, Lille, France (2015)Google Scholar
  12. Gelfand, A.E., Dey, D.K.: Bayesian model choice: asymptotics and exact calculations. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 56, 501–514 (1994)MathSciNetMATHGoogle Scholar
  13. Ghoshal, S.: The Dirichlet process, related priors and posterior asymptotics. In: Hjort, N.L., Holmes, C., Müller, P., Walker, S.G. (eds.) Bayesian Nonparametrics, pp. 22–34. Cambridge University Press, Cambridge (2010)Google Scholar
  14. Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng. 15(3), 515–528 (2003)CrossRefGoogle Scholar
  15. Hennig, C.: Methods for merging Gaussian mixture components. Adv. Data Anal. Classif. 4(1), 3–34 (2010)MathSciNetCrossRefMATHGoogle Scholar
  16. Huang, Z., Gelman, A.: Sampling for Bayesian computation with large datasets. Available at SSRN 1010107 (2005)Google Scholar
  17. Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)CrossRefGoogle Scholar
  18. Kulis, B., Jordan, M.I.: Revisiting k-means: new algorithms via Bayesian nonparametrics. In: Langford, J., Pineau, J. (eds.) Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp. 513–520. ACM, New York, NY, USA (2012)Google Scholar
  19. Lin, D.: Online learning of nonparametric mixture models via sequential variational approximation. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, pp. 395–403. Curran Associates Inc., USA (2013)Google Scholar
  20. MacEachern, S.N., Clyde, M., Liu, J.S.: Sequential importance sampling for nonparametric Bayes models: the next generation. Can. J. Stat. 27(2), 251–267 (1999)MathSciNetCrossRefMATHGoogle Scholar
  21. Mitra, R., Müller, P., Liang, S., Yue, L., Ji, Y.: A Bayesian graphical model for ChIP-seq data on histone modifications. J. Am. Stat. Assoc. 108(501), 69–80 (2013)MathSciNetCrossRefMATHGoogle Scholar
  22. Newton, M.A., Quintana, F.A., Zhang, Y.: Nonparametric Bayes methods using predictive updating. In: Dey, D., Müller, P., Sinha, D. (eds.) Practical Nonparametric and Semiparametric Bayesian Statistics, pp. 45–61. Springer, New York (1998)CrossRefGoogle Scholar
  23. Pennell, M.L., Dunson, D.B.: Fitting semiparametric random effects models to large data sets. Biostatistics 8(4), 821–834 (2007)CrossRefMATHGoogle Scholar
  24. Pettit, L.: The conditional predictive ordinate for the normal distribution. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 52, 175–184 (1990)MathSciNetMATHGoogle Scholar
  25. Scott, S.L., Blocker, A.W., Bonassi, F.V., Chipman, H.A., George, E.I., McCulloch, R.E.: Bayes and big data: the consensus Monte Carlo algorithm. Int. J. Manag. Sci. Eng. Manag. 11(2), 78–88 (2016)Google Scholar
  26. Tank, A., Foti, N., Fox, E.: Streaming variational inference for Bayesian nonparametric mixture models. In: Lebanon, G., Vishwanathan, S.V.N. (eds.) Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 38, pp. 968–976. PMLR, San Diego, California, USA (2015)Google Scholar
  27. Thorndike, R.L.: Who belongs in the family? Psychometrika 18(4), 267–276 (1953)CrossRefGoogle Scholar
  28. Wang, L., Dunson, D.B.: Fast Bayesian inference in Dirichlet process mixture models. J. Comput. Graph. Stat. 20(1), 196–216 (2011)MathSciNetCrossRefGoogle Scholar
  29. Williamson, S.A., Dubey, A., Xing, E.P.: Parallel Markov chain Monte Carlo for nonparametric mixture models. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, ICML’13, vol. 28, pp. I-98–I-106. (2013)Google Scholar
  30. Xu, R., Wunsch, D., et al.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)CrossRefGoogle Scholar
  31. Zhao, W., Ma, H., He, Q.: Parallel k-means clustering based on MapReduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) Cloud Computing, pp. 674–679. Springer, Berlin (2009)Google Scholar
  32. Zhu, Y., Xu, Y., Helseth, D.L., Gulukota, K., Yang, S., Pesce, L.L., Mitra, R., Müller, P., Sengupta, S., Guo, W., et al.: Zodiac: A comprehensive depiction of genetic interactions in cancer by integrating TCGA data. J. Natl. Cancer Inst. 107(8), 1–9 (2015)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Departamento de EstatísticaUniversidade Federal de São CarlosSão CarlosBrazil
  2. 2.Department of MathematicsUT AustinAustinUSA
  3. 3.NorthShore University HealthSystemEvanstonUSA
  4. 4.NorthShore University HealthSystem, Evanston and University of ChicagoEvanstonUSA

Personalised recommendations