Abstract
Mixture model-based clustering has become an increasingly popular data analysis technique since its introduction over fifty years ago, and is now commonly utilized within a family setting. Families of mixture models arise when the component parameters, usually the component covariance (or scale) matrices, are decomposed and a number of constraints are imposed. Within the family setting, model selection involves choosing the member of the family, i.e., the appropriate covariance structure, in addition to the number of mixture components. To date, the Bayesian information criterion (BIC) has proved most effective for model selection, and the expectation-maximization (EM) algorithm is usually used for parameter estimation. In fact, this EM-BIC rubric has virtually monopolized the literature on families of mixture models. Deviating from this rubric, variational Bayes approximations are developed for parameter estimation and the deviance information criteria (DIC) for model selection. The variational Bayes approach provides an alternate framework for parameter estimation by constructing a tight lower bound on the complex marginal likelihood and maximizing this lower bound by minimizing the associated Kullback-Leibler divergence. The framework introduced, which we refer to as VB-DIC, is applied to the most commonly used family of Gaussian mixture models, and real and simulated data are used to compared with the EM-BIC rubric.
Similar content being viewed by others
References
Aitken, A.C. (1926). A series formula for the roots of algebraic and transcendental equations. Proceedings of the Royal Society of Edinburgh, 45, 14–22.
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19 (6), 716–723.
Banfield, J.D., & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49 (3), 803–821.
Bensmail, H., Celeux, G., Raftery, A.E., Robert, C.P. (1997). Inference in model-based cluster analysis. Statistics and Computing, 7, 1–10.
Biernacki, C., Celeux, G., Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22 (7), 719–725.
Biernacki, C., & Lourme, A. (2019). Unifying data units and models in (co-)clustering. Advances in Data Analysis and Classification, 13 (1), 7–31.
Bingham, C. (1974). An antipodally symmetric distribution on the sphere. The Annals of Statistics, 2 (6), 1201–1225.
Blei, D.M., Kucukelbir, A., McAuliffe, J.D. (2017). Variational inference: a review for statisticians. Journal of the American Statistical Association, 112 (518), 859–877.
Bock, H.H. (1996). Probabilistic models in cluster analysis. Computational Statistics and Data Analysis, 23, 5–28.
Bock, H.H. (1998a). Data science, classification and related methods, (pp. 3–21). New York: Springer-Verlag.
Bock, H.H. (1998b). Probabilistic approaches in cluster analysis. Bulletin of the International Statistical Institute, 57, 603–606.
Böhning, D., Dietz, E., Schaub, R., Schlattmann, P., Lindsay, B. (1994). The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals of the Institute of Statistical Mathematics, 46, 373–388.
Boulesteix, A.-L., Durif, G., Lambert-Lacroix, S., Peyre, J., Strimmer, K. (2018). plsgenomics: PLS Analyses for Genomics. R package version 1.5-2.
Bouveyron, C., & Brunet-Saumard, C. (2014). Model-based clustering of high-dimensional data: a review. Computational Statistics and Data Analysis, 71, 52–78.
Browne, R.P., & McNicholas, P.D. (2014). Estimating common principal components in high dimensions. Advances in Data Analysis and Classification, 8 (2), 217–226.
Casella, G., Mengersen, K., Robert, C., Titterington, D. (2002). Perfect samplers for mixtures of distributions. Journal of the Royal Statistical Society: Series B, 64, 777–790.
Celeux, G., & Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition, 28, 781–793.
Celeux, G., Hurn, M., Robert, C. (2000). Computational and inferential difficulties with mixture posterior distributions. Journal of the American Statistical Association, 95, 957–970.
Cheam, A.S.M., Marbac, M., McNicholas, P.D. (2017). Model-based clustering for spatiotemporal data on air quality monitoring. Environmetrics, 93, 192–206.
Corduneanu, A., & Bishop, C. (2001). Variational Bayesian model selection for mixture distributions. In Artificial intelligence and statistics (pp. 27–34). Los Altos: Morgan Kaufmann.
Dang, U.J., Browne, R.P., McNicholas, P.D. (2015). Mixtures of multivariate power exponential distributions. Biometrics, 71 (4), 1081–1089.
Dang, U.J., Punzo, A., McNicholas, P.D., Ingrassia, S., Browne, R.P. (2017). Multivariate response and parsimony for Gaussian cluster-weighted models. Journal of Classification, 34 (1), 4–34.
Day, N.E. (1969). Estimating the components of a mixture of normal distributions. Biometrika, 56 (3), 463–474.
Dempster, A.P., Laird, N.M., Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39 (1), 1–38.
Diebolt, J., & Robert, C. (1994). Estimation of finite mixture distributions through Bayesian sampling. Journal of the Royal Statistical Society: Series B, 56, 363–375.
Fraley, C., & Raftery, A.E. (2007). Bayesian regularization for normal mixture estimation and model-based clustering. Journal of Classification, 24, 155–181.
Franczak, B.C., Browne, R.P., McNicholas, P.D. (2014). Mixtures of shifted asymmetric Laplace distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36 (6), 1149–1157.
Gallaugher, M.P.B., & McNicholas, P.D. (2018a). Finite mixtures of skewed matrix variate distributions. Pattern Recognition, 80, 83–93.
Gallaugher, M.P.B., & McNicholas, P.D. (2018b). A mixture of matrix variate bilinear factor analyzers. In: Proceedings of the Joint Statistical Meetings. Alexandria, VA: American Statistical Association. Also available as arXiv preprint. arXiv:1712.08664v3.
Gallaugher, M.P.B., & McNicholas, P.D. (2019a). Mixtures of skewed matrix variate bilinear factor analyzers. Advances in Data Analysis and Classification. To appear. https://doi.org/10.1007/s11634-019-00377-4.
Gallaugher, M.P.B., & McNicholas, P.D. (2019b). On fractionally-supervised classification: weight selection and extension to the multivariate t-distribution. Journal of Classification, 36 (2), 232–265.
Gelman, A., Stern, H.S., Carlin, J.B., Dunson, D.B., Vehtari, A., Rubin, D.B. (2013). Bayesian data analysis. Boca Raton: Chapman and Hall/CRC Press.
Gupta, A., & Nagar, D. (2000). Matrix variate distributions. Boca Raton: Chapman & Hall/CRC Press.
Hartigan, J.A., & Wong, M.A. (1979). A k-means clustering algorithm. Applied Statistics, 28 (1), 100–108.
Hasselblad, V. (1966). Estimation of parameters for a mixture of normal distributions. Technometrics, 8 (3), 431–444.
Hoff, P. (2012). rstiefel: random orthonormal matrix generation on the Stiefel manifold. R package version 0.9.
Hoff, P.D. (2009). Simulation of the matrix Bingham-von Mises-Fisher distribution, with applications to multivariate and relational data. Journal of Computational and Graphical Statistics, 18 (2), 438–456.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.
Jasra, A., Holmes, C.C., Stephens, D.A. (2005). Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Journal of the Royal Statistical Society: Series B, 10 (1), 50–67.
Jordan, M., Ghahramani, Z., Jaakkola, T., Saul, L. (1999). An introduction to variational methods for graphical models. Machine Learning, 37, 183–233.
Lee, S., & McLachlan, G.J. (2014). Finite mixtures of multivariate skew t-distributions: some recent and new results. Statistics and Computing, 24, 181–202.
Lee, S.X., & McLachlan, G.J. (2016). Finite mixtures of canonical fundamental skew t-distributions – the unification of the restricted and unrestricted skew t-mixture models. Statistics and Computing, 26 (3), 573–589.
Lin, T., McLachlan, G.J., Lee, S.X. (2016). Extending mixtures of factor models using the restricted multivariate skew-normal distribution. Journal of Multivariate Analysis, 143, 398–413.
Lin, T. -I., McNicholas, P. D., Hsiu, J. H. (2014). Capturing patterns via parsimonious t mixture models. Statistics and Probability Letters, 88, 80–87.
MacQueen, J.B. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press.
McGrory, C., & Titterington, D. (2007). Variational approximations in Bayesian model selection for finite mixture distributions. Computational Statistics and Data Analysis, 51, 5352–5367.
McGrory, C., & Titterington, D. (2009). Variational Bayesian analysis for hidden Markov models. Australian and New Zealand Journal of Statistics, 51, 227–244.
McGrory, C., Titterington, D., Pettitt, A. (2009). Variational Bayes for estimating the parameters of a hidden Potts model. Computational Statistics and Data Analysis, 19 (3), 329–340.
McLachlan, G.J., & Krishnan, T. (2008). The EM algorithm and extensions, 2nd edn. New York: Wiley.
McNicholas, P.D. (2010). Model-based classification using latent Gaussian mixture models. Journal of Statistical Planning and Inference, 140 (5), 1175–1181.
McNicholas, P.D. (2016a). Mixture model-based classification. Boca Raton: Chapman & Hall/CRC Press.
McNicholas, P.D. (2016b). Model-based clustering. Journal of Classification, 33 (3), 331–373.
McNicholas, P.D., & Murphy, T.B. (2008). Parsimonious Gaussian mixture models. Statistics and Computing, 18, 285–296.
McNicholas, P.D., & Murphy, T.B. (2010). Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics, 26 (21), 2705–2712.
Melnykov, V., & Zhu, X. (2018). On model-based clustering of skewed matrix data. Journal of Multivariate Analysis, 167, 181–194.
Morris, K., & McNicholas, P.D. (2016). Clustering, classification, discriminant analysis, and dimension reduction via generalized hyperbolic mixtures. Computational Statistics and Data Analysis, 97, 133–150.
Morris, K., Punzo, A., McNicholas, P.D., Browne, R.P. (2019). Asymmetric clusters and outliers: mixtures of multivariate contaminated shifted asymmetric Laplace distributions. Computational Statistics and Data Analysis, 132, 145–166.
Murray, P.M., Browne, R.B., McNicholas, P.D. (2014a). Mixtures of skew-t factor analyzers. Computational Statistics and Data Analysis, 77, 326–335.
Murray, P.M., Browne, R.P., McNicholas, P.D. (2019). Mixtures of hidden truncation hyperbolic factor analyzers. Journal of Classification. To appear. https://doi.org/10.1007/s00357-019-9309-y.
Murray, P.M., McNicholas, P.D., Browne, R.P. (2014b). A mixture of common skew-t factor analyzers. Stat, 3 (1), 68–82.
Neath, R.C., & et al. (2013). On convergence properties of the Monte Carlo EM algorithm. In: Advances in modern statistical theory and applications: a Festschrift in Honor of Morris L. Eaton, pp.43–62. Institute of Mathematical Statistics.
O’Hagan, A., Murphy, T.B., Gormley, I.C., McNicholas, P.D., Karlis, D. (2016). Clustering with the multivariate normal inverse Gaussian distribution. Computational Statistics and Data Analysis, 93, 18–30.
Pearson, K. (1894). Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London A, 185, 71–110.
Punzo, A., Blostein, M., McNicholas, P.D. (2020). High-dimensional unsupervised classification via parsimonious contaminated mixtures. Pattern Recognition, 98, 107031.
R Core Team. (2018). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing.
Rand, W.M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66, 846–850.
Richardson, S., & Green, P. (1997). On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society: Series B, 59, 731–792.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6 (2), 461–464.
Scrucca, L., Fop, M., Murphy, T.B., Raftery, A.E. (2016). mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8 (1), 205–233.
Spiegelhalter, D., Best, N., Carlin, B., Van der Linde, A. (2002). Bayesian measures of model complexity and fit (with discussion). Journal of the Royal Statistical Society: Series B, 64, 583–639.
Stephens, M. (1997). Bayesian methods for mixtures of normal distributions. Oxford: Ph.D. thesis University of Oxford.
Stephens, M. (2000). Bayesian analysis of mixture models with an unknown number of components — an alternative to reversible jump methods. The Annals of Statistics, 28, 40–74.
Subedi, S., & McNicholas, P.D. (2014). Variational Bayes approximations for clustering via mixtures of normal inverse Gaussian distributions. Advances in Data Analysis and Classification, 8 (2), 167–193.
Subedi, S., Punzo, A., Ingrassia, S., McNicholas, P.D. (2015). Cluster-weighed t-factor analyzers for robust model-based clustering and dimension reduction. Statistical Methods and Applications, 24 (4), 623–649.
Titterington, D.M., Smith, A.F.M., Makov, U.E. (1985). Statistical analysis of finite mixture distributions. Chichester: John Wiley & Sons.
Tortora, C., Franczak, B.C., Browne, R.P., McNicholas, P.D. (2019). A mixture of coalesced generalized hyperbolic distributions. Journal of Classification, 36 (1), 26–57.
Ueda, N., & Ghahramani, Z. (2002). Bayesian model search for mixture models based on optimizing variational bounds. Neural Networks, 15, 1223–1241.
Venables, W.N., & Ripley, B.D. (2002). Modern applied statistics with S, 4th edn. New York: Springer.
Viroli, C. (2011). Finite mixtures of matrix normal distributions for classifying three-way data. Statistics and Computing, 21 (4), 511–522.
Vrbik, I., & McNicholas, P.D. (2014). Parsimonious skew mixture models for model-based clustering and classification. Computational Statistics and Data Analysis, 71, 196–210.
Vrbik, I., & McNicholas, P.D. (2015). Fractionally-supervised classification. Journal of Classification, 32 (3), 359–381.
Wang, X., He, C.Z., Sun, D. (2005). Bayesian inference on the patient population size given list mismatches. Statistics in Medicine, 24 (2), 249–267.
Wolfe, J.H. (1965). A computer program for the maximum likelihood analysis of types. Technical Bulletin 65–15, U.S.Naval Personnel Research Activity.
Zhu, X., & Melnykov, V. (2018). Manly transformation in finite mixture modeling. Computational Statistics & Data Analysis, 121, 190–208.
Funding
This work was supported by a Postgraduate Scholarship from the Natural Science and Engineering Research Council of Canada (Subedi); the Canada Research Chairs program (McNicholas); and an E.W.R. Steacie Memorial Fellowship (McNicholas).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Posterior distributions for the parameters of eigen-decomposed covariance matrix
Appendix B: Posterior expected value of the precision parameters of the eigen-decomposed covariance matrix
Appendix C: Mathematical details for the EEV and VEV Models
3.1 C.1 EEV Model
The mixing proportions were assigned a Dirichlet prior distribution, such that
For the mean, a Gaussian distribution conditional on the covariance matrix was used, such that
For the parameters of the covariance matrix, the following priors were used: the k th diagonal elements of (λA)− 1 were assigned a Gamma \((a^{(0)}_{k},b^{(0)}_{k})\) distribution and Dg was assigned a matrix von Mises-Fisher \((\mathbf {C}^{(0)}_{g})\) distribution. By setting τ = (λA)− 1, its prior can be written
where τk is the k th diagonal element of τ = (λA)− 1.
The matrix D has a density as defined by Gupta and Nagar (2000):
for D ∈ O(d,d), where O(d,d) is the Stiefel manifold of d × d matrices, [dD] is the unit invariant measure on O(d,d), and A(0) and \(\mathbf {B}_{g}^{(0)}\) are symmetric and diagonal matrices, respectively.
The joint distribution of μ1,…,μG, τ, and D is
The likelihood of the data can be written
Therefore, the joint posterior distribution of μ, τ, and D can be written
Thus, the posterior distribution of mean becomes
where \(\beta _{g} = \beta _{g}^{(0)}+{\sum }_{i=1}^{n}\hat {z}_{ig}\) and
The posterior distribution for the k th diagonal element of τ = (λA)− 1 is
where \(a_{k}=a_{k}^{(0)}+d{\sum }_{g=1}^{G}{\sum }_{i=1}^{n}\hat {z}_{ig}=a_{k}^{(0)}+dn\) and
We have
which has the functional form of a Bingham matrix distribution, i.e., the form
where \(\mathbf {Q}_{g} = \mathbf {Q}_{g}^{(0)}+\boldsymbol {\tau }\) and
3.2 C.2 VEV Model
Similarly, the posterior distribution of Dg for the VEV model has the form
which has the functional form of a Bingham matrix distribution, i.e., the form
where \(\mathbf {Q}_{g}=\mathbf {Q}_{g}^{(0)}+\boldsymbol {\tau }_{g}\) and
Rights and permissions
About this article
Cite this article
Subedi, S., McNicholas, P.D. A Variational Approximations-DIC Rubric for Parameter Estimation and Mixture Model Selection Within a Family Setting. J Classif 38, 89–108 (2021). https://doi.org/10.1007/s00357-019-09351-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-019-09351-3