Abstract
Cluster analysis is the task of grouping a set of objects in such a way that objects in the same cluster are similar to each other. It is widely used in many fields including machine learning, bioinformatics, and computer graphics. In all of these applications, the partition is an inference goal, along with the number of clusters and their distinguishing characteristics. Mixtures of factor analyzers is a special case of model-based clustering which assumes the variance of each cluster comes from a factor analysis model. It simplifies the Gaussian mixture model through parameter dimension reduction and conceptually represents the variables as coming from a lower dimensional subspace where the clusters are separate. In this paper, we introduce a new RJMCMC (reversible-jump Markov chain Monte Carlo) inferential procedure for the family of constrained MFA models.
The three goals of inference here are the partition of the objects, estimation of the number of clusters, and identification and estimation of the covariance structure of the clusters; each therefore has posterior distributions. RJMCMC is the major sampling tool, which allows the dimension of the parameters to be estimated. We present simulations comparing the estimation of the clustering parameters and the partition between this inferential technique and previous methods. Finally, we illustrate these new methods with a dataset of DNA methylation measures for subjects with different brain tumor types. Our method uses four latent factors to correctly discover the five brain tumor types without assuming a constant variance structure and it classifies subjects with an excellent classification performance.
Similar content being viewed by others
References
Blake, C. (1998). Uci repository of machine learning databases. https://archive.ics.uci.edu/ml/index.php.
Capper, D., Jones, D.T.W., Sill, M., Hovestadt, V., Schrimpf, D., Sturm, D., Koellsche, C., Sahm, F., Chavez, L., Reuss, D.E., & et al. (2018). DNA methylation-based classification of central nervous system tumours. Nature, 555(7697), 469–474.
Diebolt, J., & Robert, C.P. (1994). Estimation of finite mixture distributions through Bayesian sampling. Journal of the Royal Statistical Society: Series B (Methodological), 56(2), 363–375.
Escobar, M.D., & West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the american statistical association, 90(430), 577–588.
Fokoué, E, & Titterington, D.M. (2003). Mixtures of factor analysers. Bayesian estimation and inference by stochastic simulation. Machine Learning, 50 (1-2), 73–94.
Forina, M., Armanino, C., Lanteri, S., & Tiscornia, E. (1983). Classification of olive oils from their fatty acid composition. In Food research and data analysis: proceedings from the IUFoST Symposium, September 20-23, 1982, Oslo, Norway/edited by H. Martens and H. Russwurm, Jr, London, Applied Science Publishers.
Forina, M., Leardi, R., Armanino, C., Lanteri, S., Conti, P., & Princi, P. (1988). PARVUS: An extendable package of programs for data exploration, classification and correlation. Journal of Chemometrics, 4(2), 191–193.
Ghahramani, Z., Hinton, G.E., & et al. (1996). The EM algorithm for mixtures of factor analyzers. Technical report, Technical Report CRG-TR-96-1 University of Toronto.
Hoadley, K.A., Yau, C., Hinoue, T., Wolf, D.M., Lazar, A.J., Drill, E., Shen, R., Taylor, A.M., Cherniack, A.D., & Thorsson, V. (2018). Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell, 173(2), 291–304.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of classification, 2(1), 193–218.
Larjo, A., & Lähdesmäki, H. (2015). Using multi-step proposal distribution for improved MCMC convergence in Bayesian network structure learning. EURASIP Journal on Bioinformatics and Systems Biology, 2015(1), 6.
Lopes, H.F., & West, M. (2004). Bayesian model assessment in factor analysis. Statistica Sinica, 14(1), 41–67.
Lu, X. (2019). Model selection and variable selection for the mixture of factor analyzers model. PhD thesis, University of Rochester.
Lu, X., Li, Y., & Love, T. (2020). bpgmm: Bayesian model selection approach for parsimonious Gaussian mixture models. URL https://CRAN.R-project.org/package=bpgmm. R package version 1.0.7.
McLachlan, G., & Peel, D. (2000). Mixtures of factor analyzers. In Proceedings of the seventeenth international conference on machine learning, San Francisco, pages 599–606. Morgan Kaufmann.
McLachlan, G.J., & Basford, K.E. (1988). Mixture models: Inference and applications to clustering. New York: Marcel Dekker Inc.
McLachlan, G.J., Peel, D., & Bean, R.W. (2003). Modelling high-dimensional data by mixtures of factor analyzers. Computational Statistics & Data Analysis, 41(3-4), 379–388.
McNicholas, P.D. (2016). Model-based clustering. Journal of Classification, 33(3), 331–373.
McNicholas, P.D., ElSherbiny, A., McDaid, A.F., & Murphy, T.B. (2019). pgmm: Parsimonious Gaussian mixture models. https://CRAN.R-project.org/package=pgmm. R package version 1.2.4.
McNicholas, P.D., & Murphy, T.B. (2008). Parsimonious Gaussian mixture models. Statistics and Computing, 18(3), 285–296.
Meng, X.L., & Dyk, D.V. (1997). The EM algorithm—an old folk-song sung to a fast new tune. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(3), 511–567.
Mengersen, K.L., & Robert, C.P. (1996). Testing for mixtures: a Bayesian entropic approach, MA: Oxford University Press, Cambridge.
Murphy, K., Viroli, C., & Gormley, I.C. (2020). Infinite mixtures of infinite factor analysers. Bayesian Analysis, 15(3), 937–963.
Nobile, A. (1994). Bayesian analysis of finite mixture distributions. Pittsburgh: PhD thesis, PhD Thesis. Carnegie Mellon University.
Panagiotis, P. (2018). Overfitting Bayesian mixtures of factor analyzers with an unknown number of components. Computational Statistics & Data Analysis, 124, 220–234.
Papastamoulis, P. (2020). fabMix: Overfitting bayesian mixtures of factor analyzers with parsimonious covariance and unknown number of components. https://CRAN.R-project.org/package=fabMix. R package version 5.0.
Phillips, D.B., & Smith, A.F.M. (1996). Bayesian model comparison via jump diffusions, (pp. 215–239). New York: Springer.
Press, S.J., & Shigemasu, K. (1989). Bayesian inference in factor analysis, (pp. 271–287). New York: Springer.
Richardson, S., & Green, P.J. (1997). On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society: series B (statistical methodology), 59(4), 731–792.
Rodríguez-Paredes, M, & Manel, E. (2011). Cancer epigenetics reaches mainstream oncology. Nature Medicine, 17(3), 330.
Roeder, K., & Wasserman, L. (1997). Practical Bayesian density estimation using mixtures of normals. Journal of the American Statistical Association, 92 (439), 894–902.
Rousseau, J., & Mengersen, K. (2011). Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal of the Royal Statistical Society Series B (Statistical Methodology), 73(5), 689–710.
Schwarz, G., & et al. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461– 464.
Sturm, D., Orr, B.A., Toprak, U.H., Hovestadt, V., Jones, D.T.W., Capper, D., Sill, M., Buchhalter, I., Northcott, P.A., Leis, I., & et al. (2016). New brain tumor entities emerge from molecular classification of CNS-PNETs. Cell, 164(5), 1060–1072.
Tipping, M.E., & Bishop, C.M. (1999). Mixtures of probabilistic principal component analyzers. Neural Computation, 11(2), 443–482.
Utsugi, A., & Kumagai, T. (2001). Bayesian analysis of mixtures of factor analyzers. Neural Computation, 13(5), 993–1002.
Vats, D., Flegal, J.M., & Jones, G.L. (2019). Multivariate output analysis for Markov chain Monte Carlo. Biometrika, 106(2), 321–337.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Lu, X., Li, Y. & Love, T. On Bayesian Analysis of Parsimonious Gaussian Mixture Models. J Classif 38, 576–593 (2021). https://doi.org/10.1007/s00357-021-09391-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-021-09391-8