Abstract
Mixture models with (multivariate) Gaussian components are a popular tool in model-based clustering. Such models are often fitted by a procedure that maximizes the likelihood, such as the EM algorithm. At convergence, the maximum likelihood parameter estimates are typically reported, but in most cases little emphasis is placed on the variability associated with these estimates. In part this may be due to the fact that standard errors are not directly calculated in the model-fitting algorithm, either because they are not required to fit the model, or because they are difficult to compute. The examination of standard errors in model-based clustering is therefore typically neglected. Sampling based methods, such as the jackknife (JK), bootstrap (BS) and parametric bootstrap (PB), are intuitive, generalizable approaches to assessing parameter uncertainty in model-based clustering using a Gaussian mixture model. This paper provides a review and empirical comparison of the jackknife, bootstrap and parametric bootstrap methods for producing standard errors and confidence intervals for mixture parameters. The performance of such sampling methods in the presence of small and/or overlapping clusters requires consideration however; here the weighted likelihood bootstrap (WLBS) approach is demonstrated to be effective in addressing this concern in a model-based clustering framework. The JK, BS, PB and WLBS methods are illustrated and contrasted through simulation studies and through the traditional Old Faithful data set and also the Thyroid data set. The MclustBootstrap function, available in the most recent release of the popular R package mclust, facilitates the implementation of the JK, BS, PB and WLBS approaches to estimating parameter uncertainty in the context of model-based clustering. The JK, WLBS and PB approaches to variance estimation are shown to be robust and provide good coverage across a range of real and simulated data sets when performing model-based clustering; but care is advised when using the BS in such settings. In the case of poor model fit (for example for data with small and/or overlapping clusters), JK and BS are found to suffer from not being able to fit the specified model in many of the sub-samples formed. The PB also suffers when model fit is poor since it is reliant on data sets simulated from the model upon which to base the variance estimation calculations. However the WLBS will generally provide a robust solution, driven by the fact that all observations are represented with some weight in each of the sub-samples formed under this approach.
Similar content being viewed by others
References
Andrews DW, Buchinsky M (2000) A three-step method for choosing the number of bootstrap repetitions. Econometrica 68(1):23–51
Andrews DW, Guggenberger P (2009) Incorrect asymptotic size of subsampling procedures based on post-consistent model selection estimators. J Econom 152(1):19–27
Azzalini A, Bowman A (1990) A look at some data on the old faithful geyser. Appl Stat 39(3):357–365
Basford K, Greenway D, McLachlan G, Peel D (1997) Standard errors of fitted means under normal mixture models. Comput Stat 12:1–17
Boldea O, Magnus J (2009) Maximum likelihood estimation of the multivariate normal mixture model. J Am Stat Assoc 104:1539–1549
Bühlmann P (1997) Sieve bootstrap for time series. Bernoulli 3(2):123–148
Coomans D, Broeckaert I, Jonckheer M, Massart D (1983) Comparison of multivariate discrimination techniques for clinical data-application to the thyroid functional state. Methods Inf Med 22(02):93–101
Davison AC, Hinkley DV (1997) Bootstrap methods and their application, vol 1. Cambridge University Press, Cambridge
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38
Diebolt J, Ip E (1996) Stochastic EM: method and application. In: Gilks WR, Richardson R, Spiegelhalter D (eds) Markov Chain Monte Carlo in practice. Chapman & Hall, London, pp 259–273
Efron B (1981) Nonparametric estimates of standard error: the jackknife, the bootstrap and other methods. Biometrika 68(3):589–599
Efron B (1982) The jackknife, the bootstrap, and other resampling plans, vol 38. SIAM, Philadelphia
Efron B (1994) Missing data, imputation and the bootstrap (with discussion). J Am Stat Assoc 89(426):463–479
Efron B, Stein C (1981) The jackknife estimate of variance. Ann Stat 9(3):586–596
Efron B, Tibshirani RJ (1993) An introduction to the bootstrap. Chapman & Hall/CRC, New York
Everitt BS, Hothorn T (2009) A handbook of statistical analyses using R, 2nd edn. Chapman & Hall, London
Ford I, Silvey S (1980) A sequentially constructed design for estimating a nonlinear parametric function. Biometrika 67(2):381–388
Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41:578–588
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–612
Fraley C, Raftery AE, Murphy TB, Scrucca L (2012) mclust Version 4 for R: Normal mixture modeling for model-based clustering, classification, and density estimation. Tech. Rep. No. 597, Department of Statistics, University of Washington, USA
Grün B, Leisch F (2007) Fitting finite mixtures of generalized linear regressions in R. Comput Stat Data Anal 51(11):5247–5252
Hong H, Mahajan A, Nekipelov D (2015) Extremum estimation and numerical derivatives. J Econom 188(1):250–263
Lee SX, McLachlan GJ (2013a) EMMIX-uskew: an R package for fitting mixtures of multivariate skew t-distributions via the EM algorithm. J Stat Softw 55(12):1–22
Lee SX, McLachlan GJ (2013b) Model-based clustering and classification with non-normal mixture distributions. Stat Methods Appl 22(4):427–454
Leeb H, Pötscher BM (2005) Model selection and inference: facts and fiction. Econom Theory 21(1):21–59
McLachlan G (1987) On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. J R Stat Soc Ser C 36:318–324
McLachlan G, Peel D, Basford K, Adams P (1999) Fitting mixtures of normal and \(t\)-components. J Stat Softw 4(2)
McLachlan GJ, Krishnan T (1997) The EM algorithm and extensions. Wiley, New York
McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York
Meilijson I (1989) A fast improvement to the EM algorithm on its own terms. J R Stat Soc Ser B 51(1):127–138
Meng X, Rubin D (1991) Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm. J Am Stat Assoc 86(416):899–909
Meng XL, Rubin D (1989) Obtaining asymptotic variance-covariance matrices for missing-data problems using EM. In: Proceedings of the American statistical association (statistical computing section), American Statistical Association, Alexandria, Virginia, pp 140–144
Mita N, Jiao J, Kani K, Tabuchi A, Hara H (2012) The parametric and non-parametric bootstrap resamplings for the visual acuity measurement. Kawasaki J Med Welf 18:19–28
Moulton LH, Zeger SL (1991) Bootstrapping generalized linear models. Comput Stat Data Anal 11(1):53–63
Newton MA, Raftery AE (1994) Approximate Bayesian inference with the weighted likelihood bootstrap. J R Stat Soc Ser B 56(1):3–26
Nyamundanda G, Brennan L, Gormley I (2010) Probabilistic principal component analysis for metabolomic data. BMC Bioinform 11(1):571
Pawitan Y (2000) Computing empirical likelihood from the bootstrap. Stat Probab Lett 47(4):337–345
Peel D (1998) Mixture model clustering and related topics. Ph.D thesis, University of Queensland, Brisbane
Quenouille M (1956) Notes on bias in estimation. Biometrika 43(2):343–348
R Core Team (2017) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, https://www.R-project.org/
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Shi X (1988) A note on the delete-d jackknife variance estimators. Stat Probab Lett 6(5):341–347
Stoica P, Söderström T (1982) On non-singular information matrices and local identifiability. Int J Control 36(2):323–329
Tanner MA (2012) Tools for statistical inference. Springer, New York
Titterington DM (1984) Recursive parameter estimation using incomplete data. J R Stat Soc Ser B 46(2):257–267
Tukey J (1958) Bias and confidence in not-quite large samples (abstract). Ann Math Stat 29(2):614
Turner TR (2000) Estimating the propagation rate of a viral infection of potato plants via mixtures of regressions. J R Stat Soc: Ser C 49(3):371–384
Wu CFJ (1986) Jackknife, bootstrap and other resampling methods in regression analysis. Ann Stat 14(4):1261–1295
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work is supported by the Insight Research Centre (SFI/12/RC/2289) and Science Foundation Ireland under the Research Frontiers Programme (2007/RFP/MATH281).
Appendices
Appendix A: Pairs plots of a simulated data set from Simulation Setting Three
Simulation Setting Three explores the performance and computational features of the JK, BS, PB and WLBS approaches to parameter variance estimation in a higher dimensional setting featuring overlapping and small clusters. Figures 7, 8 and 9 provide pairs plots from a single simulated data set under this setting for which \(n = 500\), \(p = 25\) and \(G = 5\). Each of the different colours/symbols in the plots denotes one of the 5 distinct clusters of observations simulated.
Appendix B: Covariance parameter estimates and standard errors for the Thyroid data
Cluster covariance estimated values are presented below using jackknife (JK), bootstrap (BS), parametric bootstrap (PB) and weighted likelihood bootstrap (WLBS) methods (with associated standard errors) for the optimal mixture of Gaussians model for the Thyroid data, group 1, where \(G = 3\) and \(p = 5\) and the optimal model has unequal diagonal covariance structure across clusters.
Cluster covariance estimated values are presented below using jackknife (JK), bootstrap (BS), parametric bootstrap (PB) and weighted likelihood bootstrap (WLBS) methods (with associated standard errors) for the optimal mixture of Gaussians model for the Thyroid data, group 2, where \(G = 3\) and \(p = 5\) and the optimal model has unequal diagonal covariance structure across clusters.
Cluster covariance estimated values are presented below using jackknife (JK), bootstrap (BS), parametric bootstrap (PB) and weighted likelihood bootstrap (WLBS) methods (with associated standard errors) for the optimal mixture of Gaussians model for the Thyroid data, group 3, where \(G = 3\) and \(p = 5\) and the optimal model has unequal diagonal covariance structure across clusters.
The following code produces all variance estimation results for the Thyroid data set, using the MclustBootstrap function in mclust.
Rights and permissions
About this article
Cite this article
O’Hagan, A., Murphy, T.B., Scrucca, L. et al. Investigation of parameter uncertainty in clustering using a Gaussian mixture model via jackknife, bootstrap and weighted likelihood bootstrap. Comput Stat 34, 1779–1813 (2019). https://doi.org/10.1007/s00180-019-00897-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-019-00897-9