Clustering multivariate data using factor analytic Bayesian mixtures with an unknown number of components

Abstract

Recent work on overfitting Bayesian mixtures of distributions offers a powerful framework for clustering multivariate data using a latent Gaussian model which resembles the factor analysis model. The flexibility provided by overfitting mixture models yields a simple and efficient way in order to estimate the unknown number of clusters and model parameters by Markov chain Monte Carlo sampling. The present study extends this approach by considering a set of eight parameterizations, giving rise to parsimonious representations of the covariance matrix per cluster. A Gibbs sampler combined with a prior parallel tempering scheme is implemented in order to approximately sample from the posterior distribution of the overfitting mixture. The parameterization and number of factors are selected according to the Bayesian information criterion. Identifiability issues related to label switching are dealt by post-processing the simulated output with the Equivalence Classes Representatives algorithm. The contributed method and software are demonstrated and compared to similar models estimated using the expectation–maximization algorithm on simulated and real datasets. The software is available online at https://CRAN.R-project.org/package=fabMix.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

References

  1. Altekar, G., Dwarkadas, S., Huelsenbeck, J.P., Ronquist, F.: Parallel metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics 20(3), 407–415 (2004). https://doi.org/10.1093/bioinformatics/btg427

    Article  Google Scholar 

  2. Bartholomew, D.J., Knott, M., Moustaki, I.: Latent Variable Models and Factor Analysis: A Unified Approach, vol. 904. Wiley, Hoboken (2011)

    Google Scholar 

  3. Bhattacharya, A., Dunson, D.B.: Sparse Bayesian infinite factor models. Biometrika 98(2), 291–306 (2011)

    MathSciNet  MATH  Google Scholar 

  4. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth International Group, Belmont, CA (1984)

    Google Scholar 

  5. Celeux, G., Hurn, M., Robert, C.P.: Computational and inferential difficulties with mixture posterior distributions. J. Am. Stat. Assoc. 95(451), 957–970 (2000a). https://doi.org/10.1080/01621459.2000.10474285

    MathSciNet  Article  MATH  Google Scholar 

  6. Celeux, G., Hurn, M., Robert, C.P.: Computational and inferential difficulties with mixture posterior distributions. J. Am. Stat. Assoc. 95(451), 957–970 (2000b)

    MathSciNet  MATH  Google Scholar 

  7. Cho, R.J., Campbell, M.J., Winzeler, E.A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T.G., Gabrielian, A.E., Landsman, D., Lockhart, D.J., Davis, R.W.: A genome-wide transcriptional analysis of the mitotic cell cycle. Molecular Cell 2(1), 65–73 (1998). https://doi.org/10.1016/S1097-2765(00)80114-8

    Article  Google Scholar 

  8. Conti, G., Frühwirth-Schnatter, S., Heckman, J.J., Piatek, R.: Bayesian exploratory factor analysis. J. Econom. 183(1), 31–57 (2014)

    MathSciNet  MATH  Google Scholar 

  9. Dellaportas, P., Papageorgiou, I.: Multivariate mixtures of normals with unknown number of components. Stat. Comput. 16(1), 57–68 (2006)

    MathSciNet  Google Scholar 

  10. Dempster, A.P., Laird, N.M., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. B 39, 1–38 (1977)

    MATH  Google Scholar 

  11. Eddelbuettel, D., François, R.: Rcpp: seamless R and C++ integration. J. Stat. Softw. 40(8), 1–18 (2011). https://doi.org/10.18637/jss.v040.i08

    Article  Google Scholar 

  12. Eddelbuettel, D., Sanderson, C.: Rcpparmadillo: accelerating R with high-performance C++ linear algebra. Comput. Stat. Data Anal. 71, 1054–1063 (2014). https://doi.org/10.1016/j.csda.2013.02.005

    MathSciNet  Article  MATH  Google Scholar 

  13. Ferguson, T.S.: A Bayesian analysis of some nonparametric problems. Ann. Stat. 1(2), 209–230 (1973)

    MathSciNet  MATH  Google Scholar 

  14. Fokoué, E., Titterington, D.: Mixtures of factor analysers. Bayesian estimation and inference by stochastic simulation. Mach. Learn. 50(1), 73–94 (2003)

    MATH  Google Scholar 

  15. Forina, M., Armanino, C., Castino, M., Ubigli, M.: Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25(3), 189–201 (1986)

    Google Scholar 

  16. Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002)

    MathSciNet  MATH  Google Scholar 

  17. Frühwirth-Schnatter, S., Malsiner-Walli, G.: From here to infinity: sparse finite versus Dirichlet process mixtures in model based clustering. Adv. Data Anal. Classif. 13, 33–64 (2019)

    MathSciNet  MATH  Google Scholar 

  18. Gaujoux, R.: doRNG: Generic Reproducible Parallel Backend for ‘foreach’ Loops. https://CRAN.R-project.org/package=doRNG, r package version 1.7.1 (2018)

  19. Gelfand, A., Smith, A.: Sampling-based approaches to calculating marginal densities. J. Am. Stat. Assoc. 85, 398–409 (1990)

    MathSciNet  MATH  Google Scholar 

  20. Geman, S., Geman, D.: Stochastic relaxation, gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. PAMI–6(6), 721–741 (1984). https://doi.org/10.1109/TPAMI.1984.4767596

    Article  MATH  Google Scholar 

  21. Geweke, J., Zhou, G.: Measuring the pricing error of the arbitrage pricing theory. Rev. Financ. Stud. 9(2), 557–587 (1996). https://doi.org/10.1093/rfs/9.2.557

    Article  Google Scholar 

  22. Geyer, C.J.: Markov chain Monte Carlo maximum likelihood. In: Proceedings of the 23rd symposium on the interface, interface foundation, Fairfax Station, Va, pp. 156–163 (1991)

  23. Geyer, C.J., Thompson, E.A.: Annealing Markov chain Monte Carlo with applications to ancestral inference. J. Am. Stat. Assoc. 90(431), 909–920 (1995). https://doi.org/10.1080/01621459.1995.10476590

    Article  MATH  Google Scholar 

  24. Ghahramani, Z., Hinton, G.E., et al.: The em algorithm for mixtures of factor analyzers. Tech. rep., Technical Report CRG-TR-96-1, University of Toronto (1996)

  25. Green, P.J.: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82(4), 711–732 (1995)

    MathSciNet  MATH  Google Scholar 

  26. Hager, W.W.: Updating the inverse of a matrix. SIAM Rev. 31(2), 221–239 (1989)

    MathSciNet  MATH  Google Scholar 

  27. Ihaka, R., Gentleman, R.: R: a language for data analysis and graphics. J. Comput. Graph. Stat. 5(3), 299–314 (1996). https://doi.org/10.1080/10618600.1996.10474713

    Article  Google Scholar 

  28. Kim, J.O., Mueller, C.W.: Factor Analysis: Statistical Methods and Practical Issues, vol. 14. Sage, Thousand Oaks (1978)

    Google Scholar 

  29. Ledermann, W.: On the rank of the reduced correlational matrix in multiple-factor analysis. Psychometrika 2(2), 85–93 (1937)

    MATH  Google Scholar 

  30. Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml. Accessed 15 Sept 2018

  31. Malsiner Walli, G., Frühwirth-Schnatter, S., Grün, B.: Model-based clustering based on sparse finite Gaussian mixtures. Stat. Comput. 26, 303–324 (2016)

    MathSciNet  MATH  Google Scholar 

  32. Malsiner Walli, G., Frühwirth-Schnatter, S., Grün, B.: Identifying mixtures of mixtures using bayesian estimation. J. Comput. Graph. Stat. 26, 285–295 (2017)

    MathSciNet  MATH  Google Scholar 

  33. Marin, J., Mengersen, K., Robert, C.: Bayesian modelling and inference on mixtures of distributions. Handb. Stat. 25(1), 577–590 (2005)

    MathSciNet  Google Scholar 

  34. Mavridis, D., Ntzoufras, I.: Stochastic search item selection for factor analytic models. Br. J. Math. Stat. Psychol. 67(2), 284–303 (2014). https://doi.org/10.1111/bmsp.12019

    MathSciNet  Article  MATH  Google Scholar 

  35. McLachlan, J., Peel, D.: Finite Mixture Models. Wiley, New York (2000)

    Google Scholar 

  36. McNicholas, P.D., ElSherbiny, A., Jampani, R.K., McDaid, A.F., Murphy, B., Banks, L.: pgmm: Parsimonious Gaussian Mixture Models. http://CRAN.R-project.org/package=pgmm, R package version 1.2.3 (2015)

  37. McNicholas, P.D.: Mixture Model-Based Classification. CRC Press, Boca Raton (2016)

    Google Scholar 

  38. McNicholas, P.D., Murphy, T.B.: Parsimonious Gaussian mixture models. Stat. Comput. 18(3), 285–296 (2008)

    MathSciNet  Google Scholar 

  39. McNicholas, P.D., Murphy, T.B.: Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics 26(21), 2705 (2010)

    Google Scholar 

  40. McNicholas, P.D., Murphy, T.B., McDaid, A.F., Frost, D.: Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput. Stat. Data Anal. 54(3), 711–723 (2010)

    MathSciNet  MATH  Google Scholar 

  41. McParland, D., Phillips, C.M., Brennan, L., Roche, H.M., Gormley, I.C.: Clustering high-dimensional mixed data to uncover sub-phenotypes: joint analysis of phenotypic and genotypic data. Stat. Med. 36(28), 4548–4569 (2017)

    MathSciNet  Google Scholar 

  42. Meng, X.L., Van Dyk, D.: The EM algorithm—an old folk-song sung to a fast new tune. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 59(3), 511–567 (1997)

    MathSciNet  MATH  Google Scholar 

  43. Murphy, K., Gormley, I.C., Viroli, C.: Infinite mixtures of infinite factor analysers (2019). arXiv preprint arXiv:1701.07010

  44. Neal, R.M.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9(2), 249–265 (2000)

    MathSciNet  Google Scholar 

  45. Nobile, A., Fearnside, A.T.: Bayesian finite mixtures with an unknown number of components: the allocation sampler. Stat. Comput. 17(2), 147–162 (2007). https://doi.org/10.1007/s11222-006-9014-7

    MathSciNet  Article  Google Scholar 

  46. Papastamoulis, P.: fabMix: Overfitting Bayesian mixtures of factor analyzers with parsimonious covariance and unknown number of components (2018a). http://CRAN.R-project.org/package=fabMix, R package version 4.5

  47. Papastamoulis, P.: Handling the label switching problem in latent class models via the ECR algorithm. Commun. Stat. Simul. Comput. 43(4), 913–927 (2014)

    MathSciNet  MATH  Google Scholar 

  48. Papastamoulis, P.: label.switching: an R package for dealing with the label switching problem in MCMC outputs. J. Stat. Softw. 69(1), 1–24 (2016)

    Google Scholar 

  49. Papastamoulis, P.: Overfitting Bayesian mixtures of factor analyzers with an unknown number of components. Comput. Stat. Data Anal. 124, 220–234 (2018b). https://doi.org/10.1016/j.csda.2018.03.007

    MathSciNet  Article  MATH  Google Scholar 

  50. Papastamoulis, P., Iliopoulos, G.: Reversible jump MCMC in mixtures of normal distributions with the same component means. Comput. Stat. Data Anal. 53(4), 900–911 (2009)

    MathSciNet  MATH  Google Scholar 

  51. Papastamoulis, P., Iliopoulos, G.: An artificial allocations based solution to the label switching problem in Bayesian analysis of mixtures of distributions. J. Comput. Graph. Stat. 19, 313–331 (2010)

    MathSciNet  Google Scholar 

  52. Papastamoulis, P., Iliopoulos, G.: On the convergence rate of random permutation sampler and ECR algorithm in missing data models. Methodol. Comput. Appl. Probab. 15(2), 293–304 (2013). https://doi.org/10.1007/s11009-011-9238-7

    MathSciNet  Article  MATH  Google Scholar 

  53. Papastamoulis, P., Rattray, M.: BayesBinMix: an R package for model based clustering of multivariate binary data. R J. 9(1), 403–420 (2017)

    Google Scholar 

  54. Plummer, M., Best, N., Cowles, K., Vines, K.: CODA: convergence diagnosis and output analysis for MCMC. R News 6(1), 7–11 (2006)

    Google Scholar 

  55. R Core Team (2016) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/, ISBN 3-900051-07-0

  56. Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)

    Google Scholar 

  57. Redner, R.A., Walker, H.F.: Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev. 26(2), 195–239 (1984)

    MathSciNet  MATH  Google Scholar 

  58. Revolution Analytics and Steve Weston (2014) foreach: Foreach looping construct for R. http://CRAN.R-project.org/package=foreach, r package version 1.4.2

  59. Revolution Analytics and Steve Weston (2015) doParallel: Foreach Parallel Adaptor for the ’parallel’ Package. http://CRAN.R-project.org/package=doParallel, r package version 1.0.10

  60. Richardson, S., Green, P.J.: On Bayesian analysis of mixtures with an unknown number of components. J. R. Stat. Soc. Ser. B 59(4), 731–758 (1997)

    MathSciNet  MATH  Google Scholar 

  61. Rousseau, J., Mengersen, K.: Asymptotic behaviour of the posterior distribution in overfitted mixture models. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 73(5), 689–710 (2011)

    MathSciNet  MATH  Google Scholar 

  62. Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)

    MathSciNet  MATH  Google Scholar 

  63. Scrucca, L., Fop, M., Murphy, T.B., Raftery, A.E.: mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J. 8(1), 205–233 (2017)

    Google Scholar 

  64. Stephens, M.: Bayesian analysis of mixture models with an unknown number of components—an alternative to reversible jump methods. Ann. Stat. 28(1), 40–74 (2000)

    MathSciNet  MATH  Google Scholar 

  65. Streuli, H.: Der heutige stand der kaffeechemie. In: 6th International Colloquium on Coffee Chemisrty, Association Scientifique International du Cafe, Bogata, Columbia, pp. 61–72 (1973)

  66. Tipping, M.E., Bishop, C.M.: Mixtures of probabilistic principal component analyzers. Neural Comput. 11(2), 443–482 (1999)

    Google Scholar 

  67. van Havre, Z., White, N., Rousseau, J., Mengersen, K.: Overfitting Bayesian mixture models with an unknown number of components. PLoS ONE 10(7), 1–27 (2015)

    Google Scholar 

  68. Yeung, K.Y., Fraley, C., Murua, A., Raftery, A.E., Ruzzo, W.L.: Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10), 977–987 (2001). https://doi.org/10.1093/bioinformatics/17.10.977

    Article  Google Scholar 

Download references

Acknowledgements

The author would like to acknowledge the assistance given by IT services and use of the Computational Shared Facility of the University of Manchester. The suggestions of two anonymous reviewers helped to improve the findings of this study.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Panagiotis Papastamoulis.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Overfitted mixture model

Assume that the observed data have been generated from a mixture model with \(K_0\) components

$$\begin{aligned} f_{K_0}({\varvec{x}}) = \sum _{k=1}^{K_0}w_kf_k({\varvec{x}}|{\varvec{\theta }}_k), \end{aligned}$$

where \(f_k\in {\mathcal {F}}_{{\varTheta }}=\{f(\cdot |{\varvec{\theta }}):{\varvec{\theta }}\in {\varTheta }\}\); \(k = 1,\ldots ,K_0\) denotes a member of a parametric family of distributions. Consider that an overfitted mixture model \(f_{K}({\varvec{x}})\) with \(K > K_0\) components is fitted to the data. Rousseau and Mengersen (2011) showed that the asymptotic behavior of the posterior distribution of the \(K-K_0\) redundant components depends on the prior distribution of mixing proportions \(({\varvec{w}})\). Let d denote the dimension of free parameters of the distribution \(f_k\). For the case of a Dirichlet prior distribution,

$$\begin{aligned} {\varvec{w}}\sim {\mathcal {D}}\left( \gamma _1,\ldots ,\gamma _K\right) \end{aligned}$$
(A.32)

if

$$\begin{aligned} \max \{\gamma _k; k=1,\ldots ,K\} < d/2 \end{aligned}$$

then the posterior weight of the extra components converges to zero (Theorem 1 of Rousseau and Mengersen (2011)).

Let \(f_K({\varvec{\theta }},{\varvec{z}}|{\varvec{x}})\) denote the joint posterior distribution of model parameters and latent allocation variables for a model with K components. When using an overfitted mixture model, the inference on the number of clusters reduces to (a): choosing a sufficiently large value of mixture components (K), (b): running a typical MCMC sampler for drawing samples from the posterior distribution \(f_K({\varvec{\theta }}, {\varvec{z}}|{\varvec{x}})\) and (c) inferring the number of “alive” mixture components. Note that at MCMC iteration \(t = 1,2,\ldots \) (c) reduces to keeping track of the number of elements in the set \(\varvec{K_0}^{(t)}=\{k=1,\ldots ,K: \sum _{i=1}^{n}I(z_i^{(t)}=k)>0\}\), where \(z_i^{(t)}\) denotes the simulated allocation of observation i at iteration t.

In our case, the dimension of free parameters in the k-th mixture component is equal to \(d = 2p + pq-\frac{q(q-1)}{2}\). Following Papastamoulis (2018b), we set \(\gamma _1=\cdots =\gamma _K = \frac{\gamma }{K}\); thus, the distribution of mixing proportions in Eq. (A.32) becomes

$$\begin{aligned} {\varvec{w}}\sim {\mathcal {D}}\left( \frac{\gamma }{K},\ldots ,\frac{\gamma }{K}\right) \end{aligned}$$
(A.33)

where \(0< \gamma < d/2\) denotes a pre-specified positive number. Such a value is chosen for two reasons. At first, it is smaller than d / 2 so the asymptotic results of Rousseau and Mengersen (2011) ensure that extra components will be emptied as \(n\rightarrow \infty \). Second, this choice can be related to standard practice when using Bayesian nonparametric clustering methods where the parameters of a mixture are drawn from a Dirichlet process (Ferguson 1973), that is, a Dirichlet process mixture model (Neal 2000).

Details of the MCMC sampler

Data normalization and prior parameters Before running the sampler, the raw data are standardized by applying the z-transformation

$$\begin{aligned} \frac{x_{ir} - {\bar{x}}_{r}}{\sqrt{s^2_r}}, \quad i =1,\ldots ,n;r = 1,\ldots ,p \end{aligned}$$

where \({\bar{x}}_{r} = \frac{\sum _{i=1}^{n}x_{ir}}{n}\) and \(s^{2}_{r}= \frac{1}{n-1}\sum _{i=1}^{n}\left( x_{ir}-{\bar{x}}_{r}\right) ^2\). The main reason for using standardized data is that the sampler mixes better. Furthermore, it is easier to choose prior parameters that are not depending on the observed data, that is, using the data twice. In any other case, one could use empirical prior distributions as reported in Fokoué and Titterington (2003), see also Dellaportas and Papageorgiou (2006). For the case of standardized data, the prior parameters are specified in Table 5. Standardized data are also used as input to pgmm.

Table 5 Prior parameter specification for the case of standardized data

Prior parallel tempering It is well known that the posterior surface of mixture models can exhibit many local modes (Celeux et al. 2000b; Marin et al. 2005). In such cases, simple MCMC algorithms may become trapped in minor modes and demand a very large number of iterations to sufficiently explore the posterior distribution. In order to produce a well-mixing MCMC sample and improve the convergence of our algorithm, we utilize ideas from parallel tempering schemes Geyer (1991), Geyer and Thompson (1995), Altekar et al. (2004), where different chains are running in parallel and they are allowed to switch states. Each chain corresponds to a different posterior distribution, and usually each one represents a “heated” version of the target posterior distribution. This is achieved by raising the original target to a power T with \(0\leqslant T \leqslant 1\), which flattens the posterior surface and thus easier to explore when using an MCMC sampler.

In the context of overfitting mixture models, van Havre et al. (2015) introduced a prior parallel tempering scheme, which is also applied by Papastamoulis (2018b). Under this approach, each heated chain corresponds to a model with identical likelihood as the original, but with a different prior distribution. Although the prior tempering can be imposed on any subset of parameters, it is only applied to the Dirichlet prior distribution of mixing proportions (van Havre et al. 2015). Let us denote by \(f_i({\varvec{\varphi }}|{\varvec{x}})\) and \(f_i({\varvec{\varphi }})\); \(i=1,\ldots ,J\), the posterior and prior distribution of the i-th chain, respectively. Obviously, \(f_i({\varvec{\varphi }}|{\varvec{x}}) \propto f({\varvec{x}}|{\varvec{\varphi }})f_i({\varvec{\varphi }})\). Let \({\varvec{\varphi }}^{(t)}_i\) denote the state of chain i at iteration t and assume that a swap between chains i and j is proposed. The proposed move is accepted with probability \(\min \{1,A\}\) where

$$\begin{aligned} A= & {} \frac{f_i({\varvec{\varphi }}_j^{(t)}|{\varvec{x}}) f_j({\varvec{\varphi }}_i^{(t)}|{\varvec{x}})}{f_i({\varvec{\varphi }}_i^{(t)} |{\varvec{x}})f_j({\varvec{\varphi }}_j^{(t)}|{\varvec{x}})}= \frac{f_i({\varvec{\varphi }}_j^{(t)})f_j({\varvec{\varphi }}_i^{(t)})}{f_i ({\varvec{\varphi }}_i^{(t)})f_j({\varvec{\varphi }}_j^{(t)})} \nonumber \\= & {} \frac{{\widetilde{f}}_i( w_j^{(t)}){\widetilde{f}}_j( w_i^{(t)})}{{\widetilde{f}}_i( w_i^{(t)}){\widetilde{f}}_j( w_j^{(t)})}, \end{aligned}$$
(B.34)

and \({\widetilde{f}}_i(\cdot )\) corresponds to the probability density function of the Dirichlet prior distribution related to chain \(i = 1,\ldots ,J\). According to Eq. (A.33), this is

$$\begin{aligned} {\varvec{w}}\sim {\mathcal {D}}\left( \frac{\gamma _{(j)}}{K},\ldots , \frac{\gamma _{(j)}}{K}\right) , \end{aligned}$$
(B.35)

for a pre-specified set of parameters \(\gamma _{(j)}>0\) for \(j = 1,\ldots ,J\).

In our examples, we used a total of \(J = 4\) parallel chains where the prior distribution of mixing proportions for chain j in Eq. (B.35) is selected as

$$\begin{aligned} \gamma _{(j)} = \gamma + \delta (j-1), \quad j = 1,\ldots ,J, \end{aligned}$$

where \(\delta > 0\). For example, when the overfitting mixture model uses \(K=20\) components and \(\gamma = 1\) (the default value shown in Table 5), it follows from Eq. (A.33) that the parameter vector of the Dirichlet prior of mixture weights which corresponds to the target posterior distribution (\(j = 1\)) is equal to \((0.05,\ldots ,0.05)\). Also in our examples, we have used \(\delta =1\), but in general we strongly suggest to tune this parameter until a reasonable acceptance rate is achieved. Each chain runs in parallel, and every 10 iterations we randomly select two adjacent chains \((j,j+1)\), \(j\in \{1,\ldots ,J-1\}\) and propose to swap their current states. A proposed swap is accepted with probability A in Eq. (B.34).

“Overfitting initialization” strategy We briefly describe the “overfitting initialization” procedure introduced by Papastamoulis (2018b). We used an initial period of 500 MCMC iterations where each chain is initialized from totally random starting values, but under a Dirichlet prior distribution with large prior parameter values. These values were chosen in a way that the asymptotic results of Rousseau and Mengersen (2011) guarantee that the redundant mixture components will have non-negligible posterior weights. More specifically for chain j, we assume \( w\sim {\mathcal {D}}(\gamma '_{j},\ldots ,\gamma '_{j})\) with \(\gamma '_{(j)} = \frac{d}{2} + (j-1)\frac{d}{2(J-1)}\), for \(j =1,\ldots ,J\). Then, we initialize the actual model by this state. According to Papastamoulis (2018b), this specific scheme was found to outperform other initialization procedures.

Table 6 Setup for our simulation scenarios with unequal cluster sizes. In all cases, the dimensionality of the multivariate data is equal to \(p=30\)
Fig. 8
figure8

Examples of simulated datasets with unequal cluster sizes according to scenarios 1 and 2 and \(n = 200\). The legend shows the “true” cluster sizes

Additional simulations

In the simulation section of the manuscript, the weights of the simulated datasets have been randomly generated from a Dirichlet distribution with mean equal to 1 / K, conditional on the number of clusters (K). Thus, on average, the true cluster sizes are equal. In this section, we examine the performance of the proposed method in the presence of unequal cluster sizes with respect to the size (n) of the observed data.

We replicate the simulation mechanism for scenarios 1 and 2 presented in the main text, but now we consider unequal (true) cluster sizes, as detailed in Table 6. For each case, the sample size is increasing (as shown in the last column of Table 6) while keeping all others parameters (that is, the true values of marginal means and factor loadings) constant. As shown in Table 6, in scenario 1 there are five clusters and two factors, whereas in scenario 2 there are two clusters and three factors. In total, three different examples per scenario are considered: for a given scenario, the component-specific parameters are different in each example, but the weights are the same. An instance of our three examples (per scenario) using \(n = 200\) simulated observations is shown in Fig. 8. Observe that in all cases the “true clusters” are not easily distinguishable, especially in scenario 2 where there is a high degree of cluster overlapping.

Fig. 9
figure9

Adjusted Rand index (first row), estimated number of clusters (second row) and estimated number of factors (third row) for simulated data according to scenario 1 with unequal cluster sizes and increasing sample size. The dotted line in the first row corresponds to the adjusted Rand index between the ground truth and the clustering of the data when applying the Maximum A Posteriori rule using the parameter values that generated the data (C.36). For all examples, the true number of clusters and factors is equal to 5 and 2, respectively

Fig. 10
figure10

Adjusted Rand index (1st row), estimated number of clusters (second row) and estimated number of factors (third row) for simulated data according to scenario 2 with unequal cluster sizes and increasing sample size. The dotted line in the first row corresponds to the adjusted Rand index between the ground truth and the clustering of the data when applying the Maximum A Posteriori rule using the parameter values that generated the data (C.36). For all examples, the true number of clusters and factors is equal to 2 and 3, respectively

We applied fabMix and pgmm using the same number (4) of parallel chains (for fabMix) and different starts (for pgmm) as in the simulations presented in the main paper. The results are summarized in Figs. 9 and 10 for scenarios 1 and 2, respectively. The adjusted Rand index is displayed in the first line of each figure, where the horizontal axis denotes the sample size (n) of the synthetic data. The dotted black line corresponds to the adjusted Rand index between the ground truth and the cluster assignments arising when applying the Maximum A Posteriori rule using the true parameter values that generated the data, that is,

$$\begin{aligned} {z}_i=\max _{k\in \{1,\ldots ,K^*\}}\left\{ \frac{w_k^*f_k ({\varvec{x}}_i|\theta _k^*)}{\sum _{j=1}^{K^*}w_j^*f_j ({\varvec{x}}_i|\theta _j^*)}\right\} ,\quad i=1,\ldots ,\nonumber \\ \end{aligned}$$
(C.36)

where \(K^*\), \((w_1^*,\ldots ,w^*_{K^*})\) and \((\theta _1^*,\ldots ,\theta ^*_{K^*})\) denote the values of number of components, mixing proportions and parameters of the multivariate normal densities of the mixture model used to generate the data. Observe that in all three examples of scenario 1 the dotted black line is always equal to 1, but this is not the case in the more challenging Scenario 2 due to enhanced levels of cluster overlapping.

The adjusted Rand index between the ground-truth clustering and the estimated cluster assignments arising from fabMix and pgmm is shown in the first row of Figs. 9 and 10. Clearly, the compared methods have similar performance as the sample size increases, but for smaller values of n the proposed method outperforms the pgmm package.

The estimated number of clusters, shown at the second row of Figs. 9 and 10, agrees (in most cases) with the true number of clusters, but note that our method is capable of detecting the right value earlier than pgmm. Two exceptions occur at \(n=200\) for example 2 of scenario 1 where fabMix (red line at second row of Fig. 9) inferred 6 alive clusters instead of 5, as well as at \(n=800\) for example 1 of scenario 2 where fabMix (red line at second row of Fig. 10) inferred 3 alive clusters instead of 2.

Finally, the last row in Figs. 9 and 10 displays the inferred number of factors for scenarios 1 and 2, respectively. In every single case, the estimate arising from fabMix is at least as close as the estimate arising from pgmm to the corresponding true value. Note, however, that in example 1 of scenario 2 both methods detect a smaller number of factors (2 instead of 3 factors). In all other cases, we observe that as the sample size increases both methods infer the “true” number of factors.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Papastamoulis, P. Clustering multivariate data using factor analytic Bayesian mixtures with an unknown number of components. Stat Comput 30, 485–506 (2020). https://doi.org/10.1007/s11222-019-09891-z

Download citation

Keywords

  • Mixture model
  • Factor analysis
  • MCMC
  • R package